Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread crashes when reading large file with binary data #1895

Closed
tdhock opened this issue Nov 1, 2016 · 9 comments
Closed

fread crashes when reading large file with binary data #1895

tdhock opened this issue Nov 1, 2016 · 9 comments
Milestone

Comments

@tdhock
Copy link
Member

@tdhock tdhock commented Nov 1, 2016

Hey @arunsrinivasan I saw you were investigating one of the related issues #1464 #1183 #1119 where fread crashes R. This issue is happening on my laptop which compiles data.table like this

tdhock@recycled:~/datatable-bug(master)$ cat /proc/cpuinfo 
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 15
model name	: Intel(R) Pentium(R) Dual  CPU  T2390  @ 1.86GHz
stepping	: 13
microcode	: 0xa3
cpu MHz		: 800.000
cache size	: 1024 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 2
apicid		: 0
initial apicid	: 0
fdiv_bug	: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm lahf_lm dtherm
bogomips	: 3724.24
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 15
model name	: Intel(R) Pentium(R) Dual  CPU  T2390  @ 1.86GHz
stepping	: 13
microcode	: 0xa3
cpu MHz		: 1867.000
cache size	: 1024 KB
physical id	: 0
siblings	: 2
core id		: 1
cpu cores	: 2
apicid		: 1
initial apicid	: 1
fdiv_bug	: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm lahf_lm dtherm
bogomips	: 3724.24
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

tdhock@recycled:~/datatable-bug(master)$
tdhock@recycled:~/PeakSegFPOP(robust-check)$ gcc --version
gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

tdhock@recycled:~/PeakSegFPOP(robust-check)$
> devtools::install_github("Rdatatable/data.table")
Downloading GitHub repo Rdatatable/data.table@master
from URL https://api.github.com/repos/Rdatatable/data.table/zipball/master
Installing data.table
'/home/tdhock/lib/R/bin/R' --no-site-file --no-environ --no-save --no-restore  \
  --quiet CMD INSTALL  \
  '/tmp/RtmpTFnLFV/devtoolsd4b4ad69d38/Rdatatable-data.table-2b092fb'  \
  --library='/home/tdhock/lib/R/library' --install-tests 

* installing *source* package ‘data.table’ ...
** libs
gcc -std=gnu99 -I/home/tdhock/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c assign.c -o assign.o
gcc -std=gnu99 -I/home/tdhock/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c between.c -o between.o
gcc -std=gnu99 -I/home/tdhock/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c bmerge.c -o bmerge.o
gcc -std=gnu99 -I/home/tdhock/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c chmatch.c -o chmatch.o
gcc -std=gnu99 -I/home/tdhock/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c dogroups.c -o dogroups.o
gcc -std=gnu99 -I/home/tdhock/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c fastmean.c -o fastmean.o
gcc -std=gnu99 -I/home/tdhock/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c fcast.c -o fcast.o
gcc -std=gnu99 -I/home/tdhock/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c fmelt.c -o fmelt.o
gcc -std=gnu99 -I/home/tdhock/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c forder.c -o forder.o
gcc -std=gnu99 -I/home/tdhock/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c frank.c -o frank.o
gcc -std=gnu99 -I/home/tdhock/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c fread.c -o fread.o
gcc -std=gnu99 -I/home/tdhock/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c fsort.c -o fsort.o
gcc -std=gnu99 -I/home/tdhock/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c fwrite.c -o fwrite.o
In file included from /usr/include/string.h:640:0,
                 from /home/tdhock/lib/R/include/R_ext/RS.h:31,
                 from /home/tdhock/lib/R/include/R.h:75,
                 from data.table.h:1,
                 from fwrite.c:1:
In function ‘memset’,
    inlined from ‘traceAccuracy’ at fwrite.c:78:11:
/usr/include/i386-linux-gnu/bits/string3.h:84:3: warning: call to __builtin___memset_chk will always overflow destination buffer [enabled by default]
   return __builtin___memset_chk (__dest, __ch, __len, __bos0 (__dest));
   ^
In function ‘memset’,
    inlined from ‘traceAccuracy’ at fwrite.c:79:11:
/usr/include/i386-linux-gnu/bits/string3.h:84:3: warning: call to __builtin___memset_chk will always overflow destination buffer [enabled by default]
   return __builtin___memset_chk (__dest, __ch, __len, __bos0 (__dest));
   ^
gcc -std=gnu99 -I/home/tdhock/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c gsumm.c -o gsumm.o
gcc -std=gnu99 -I/home/tdhock/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c ijoin.c -o ijoin.o
gcc -std=gnu99 -I/home/tdhock/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c init.c -o init.o
gcc -std=gnu99 -I/home/tdhock/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c inrange.c -o inrange.o
gcc -std=gnu99 -I/home/tdhock/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c openmp-utils.c -o openmp-utils.o
gcc -std=gnu99 -I/home/tdhock/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c quickselect.c -o quickselect.o
gcc -std=gnu99 -I/home/tdhock/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c rbindlist.c -o rbindlist.o
gcc -std=gnu99 -I/home/tdhock/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c reorder.c -o reorder.o
gcc -std=gnu99 -I/home/tdhock/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c shift.c -o shift.o
gcc -std=gnu99 -I/home/tdhock/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c subset.c -o subset.o
gcc -std=gnu99 -I/home/tdhock/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c transpose.c -o transpose.o
gcc -std=gnu99 -I/home/tdhock/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c uniqlist.c -o uniqlist.o
gcc -std=gnu99 -I/home/tdhock/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c vecseq.c -o vecseq.o
gcc -std=gnu99 -I/home/tdhock/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c wrappers.c -o wrappers.o
gcc -std=gnu99 -shared -L/usr/local/lib -o data.table.so assign.o between.o bmerge.o chmatch.o dogroups.o fastmean.o fcast.o fmelt.o forder.o frank.o fread.o fsort.o fwrite.o gsumm.o ijoin.o init.o inrange.o openmp-utils.o quickselect.o rbindlist.o reorder.o shift.o subset.o transpose.o uniqlist.o vecseq.o wrappers.o -fopenmp
mv data.table.so datatable.so
if [ "" != "Windows_NT" ] && [ `uname -s` = 'Darwin' ]; then install_name_tool -id datatable.so datatable.so; fi
installing to /home/tdhock/lib/R/library/data.table/libs
** R
** inst
** tests
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
* DONE (data.table)
> 

The MRE which crashes R on my laptop is from this gist https://gist.github.com/tdhock/67f8507fee522343cc813a2affcb9d37#file-crash-r

devtools::source_gist("67f8507fee522343cc813a2affcb9d37")

which gives me the following output.

tdhock@recycled:~/datatable-bug(master)$ R --vanilla < crash.R

R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: i686-pc-linux-gnu (32-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> ### Write down what package versions work with your R code, and
> ### attempt to download and load those packages. The first argument is
> ### the version of R that you used, e.g. "3.0.2" and then the rest of
> ### the arguments are package versions. For
> ### CRAN/Bioconductor/R-Forge/etc packages, write
> ### e.g. RColorBrewer="1.0.5" and if RColorBrewer is not installed
> ### then we use install.packages to get the most recent version, and
> ### warn if the installed version is not the indicated version. For
> ### GitHub packages, write "user/repo@commit"
> ### e.g. "tdhock/animint@f877163cd181f390de3ef9a38bb8bdd0396d08a4" and
> ### we use install_github to get it, if necessary.
> works_with_R <- function(Rvers,...){
+   pkg_ok_have <- function(pkg,ok,have){
+     stopifnot(is.character(ok))
+     if(!as.character(have) %in% ok){
+       warning("works with ",pkg," version ",
+               paste(ok,collapse=" or "),
+               ", have ",have)
+     }
+   }
+   pkg_ok_have("R",Rvers,getRversion())
+   pkg.vers <- list(...)
+   for(pkg.i in seq_along(pkg.vers)){
+     vers <- pkg.vers[[pkg.i]]
+     pkg <- if(is.null(names(pkg.vers))){
+       ""
+     }else{
+       names(pkg.vers)[[pkg.i]]
+     }
+     if(pkg == ""){# Then it is from GitHub.
+       ## suppressWarnings is quieter than quiet.
+       if(!suppressWarnings(require(requireGitHub))){
+         ## If requireGitHub is not available, then install it using
+         ## devtools.
+         if(!suppressWarnings(require(devtools))){
+           install.packages("devtools")
+           require(devtools)
+         }
+         install_github("tdhock/requireGitHub")
+         require(requireGitHub)
+       }
+       requireGitHub(vers)
+     }else{# it is from a CRAN-like repos.
+       if(!suppressWarnings(require(pkg, character.only=TRUE))){
+         install.packages(pkg)
+       }
+       pkg_ok_have(pkg, vers, packageVersion(pkg))
+       library(pkg, character.only=TRUE)
+     }
+   }
+ }
> works_with_R(
+   "3.3.1",
+   httr="1.0.0",
+   "Rdatatable/data.table@2b092fbae4380acac66baf923436fe796ec823d8")
Loading required package: httr
Loading required package: requireGitHub
Loading required package: data.table
> devtools::session_info()
Session info -------------------------------------------------------------------
 setting  value                       
 version  R version 3.3.1 (2016-06-21)
 system   i686, linux-gnu             
 ui       X11                         
 language en_US                       
 collate  en_US.UTF-8                 
 tz       posixrules                  
 date     2016-10-31                  

Packages -----------------------------------------------------------------------
 package       * version  date       source                                
 data.table    * 1.9.7    2016-10-31 Github (Rdatatable/data.table@2b092fb)
 devtools        1.11.1   2016-04-21 CRAN (R 3.2.2)                        
 digest          0.6.10   2016-08-02 CRAN (R 3.2.2)                        
 httr          * 1.0.0    2015-06-25 CRAN (R 3.2.2)                        
 magrittr        1.5      2014-11-22 CRAN (R 3.2.2)                        
 memoise         1.0.0    2016-01-29 CRAN (R 3.2.2)                        
 R6              2.1.1    2015-08-19 CRAN (R 3.2.2)                        
 requireGitHub * 2014.4.4 2016-08-13 local                                 
 stringi         1.1.2    2016-10-01 CRAN (R 3.3.1)                        
 stringr         1.1.0    2016-08-19 CRAN (R 3.3.1)                        
 withr           1.0.1    2016-02-04 CRAN (R 3.2.2)                        
> 
> download.xzcat.fread <- function(u){
+   request <- GET(u)
+   stop_for_status(request)
+   f <- sub(".*/", "", u)
+   writeBin(content(request), f)
+   cmd <- paste("xzcat", f)
+   fread(cmd, verbose=TRUE)
+ }
> 
> download.xzcat.fread("https://gist.github.com/tdhock/67f8507fee522343cc813a2affcb9d37/raw/7ca64af99e0fdb5c8e498d3c247fb364cb7364cc/small_fread_ok.txt.xz")
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.201169 GB.
Memory mapping ... ok
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Positioned on line 1 after skip or autostart
This line is the autostart and not blank so searching up for the last non-blank ... line 1
Detecting sep ... '\t'
Detected 5 columns. Longest stretch was from line 1 to line 30
Starting data input on line 1 (either column names or first row of data). First 10 characters: chr10	3306
Some fields on line 1 are not type character (or are empty). Treating as a data row and using default column names.
Count of eol: 6000000 (including 0 at the end)
Count of sep: 23999997
nrow = MIN( nsep [23999997] / (ncol [5] -1), neol [6000000] - endblanks [0] ) = 5999999
Type codes (point  0): 41143
Type codes (point  1): 41143
Type codes (point  2): 41143
Type codes (point  3): 41143
Type codes (point  4): 41143
Type codes (point  5): 41143
Type codes (point  6): 41143
Type codes (point  7): 41143
Type codes (point  8): 41143
Type codes (point  9): 41143
Type codes (point 10): 41143
Type codes: 41143 (after applying colClasses and integer64)
Type codes: 41143 (after applying drop or select (if supplied)
Allocating 5 column slots (5 - 0 dropped)
Read 5999999 rows and 5 (of 5) columns from 0.201 GB file in 00:00:06
Read 5999999 rows. Exactly what was estimated and allocated up front
   0.049s (  1%) Memory map (rerun may be quicker)
   0.000s (  0%) sep and header detection
   1.567s ( 26%) Count rows (wc -l)
   0.001s (  0%) Column type detection (first, middle and last 5 rows)
   0.399s (  7%) Allocation of 5999999x5 result (xMB) in RAM
   3.945s ( 66%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.022s (  0%) Changing na.strings to NA
   5.983s        Total
            V1       V2       V3         V4  V5
      1: chr10 33061100 33061100       peak Inf
      2: chr10 33061100 33061100 background Inf
      3: chr10 33061100 33061100       peak Inf
      4: chr10 33061100 33061100 background Inf
      5: chr10 33061100 33061100       peak Inf
     ---                                       
5999995: chr10 33061100 33061100       peak Inf
5999996: chr10 33061100 33061100 background Inf
5999997: chr10 33061100 33061100       peak Inf
5999998: chr10 33061100 33061100 background Inf
5999999: chr10 33061100 33061100       peak Inf
Warning message:
In fread(cmd, verbose = TRUE) :
  Stopped reading at empty line 6000000 but text exists afterwards (discarded): chr10	
> 
> download.xzcat.fread("https://gist.github.com/tdhock/67f8507fee522343cc813a2affcb9d37/raw/7ca64af99e0fdb5c8e498d3c247fb364cb7364cc/big_fread_crashes.txt.xz")
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.234697 GB.
Memory mapping ... ok
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Positioned on line 1 after skip or autostart
This line is the autostart and not blank so searching up for the last non-blank ... line 1
Detecting sep ... '\t'
Detected 5 columns. Longest stretch was from line 1 to line 30
Starting data input on line 1 (either column names or first row of data). First 10 characters: chr10	3306
Some fields on line 1 are not type character (or are empty). Treating as a data row and using default column names.
Count of eol: 7000000 (including 0 at the end)
Count of sep: 27999997
nrow = MIN( nsep [27999997] / (ncol [5] -1), neol [7000000] - endblanks [0] ) = 6999999
Type codes (point  0): 41143
Type codes (point  1): 41143
Type codes (point  2): 41143
Type codes (point  3): 41143
Type codes (point  4): 41143
Type codes (point  5): 41143
Type codes (point  6): 41143
Type codes (point  7): 41143
Type codes (point  8): 41143

 *** caught segfault ***
address 0x82bed560, cause 'memory not mapped'

Traceback:
 1: fread(cmd, verbose = TRUE)
 2: download.xzcat.fread("https://gist.github.com/tdhock/67f8507fee522343cc813a2affcb9d37/raw/7ca64af99e0fdb5c8e498d3c247fb364cb7364cc/big_fread_crashes.txt.xz")
An irrecoverable exception occurred. R is aborting now ...
Segmentation fault (core dumped)
tdhock@recycled:~/datatable-bug(master)$ 

I tried my best to create a MRE, but upon trying it on a different computer (with the same version of data.table), it does not crash R. Any ideas? Or is this a bug in my compiler?

[thocking@lg-1r17-n04 67f8507fee522343cc813a2affcb9d37]$ cat /proc/cpuinfo |head -20
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 45
model name	: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
stepping	: 7
microcode	: 1808
cpu MHz		: 2000.029
cache size	: 20480 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 8
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
[thocking@lg-1r17-n04 67f8507fee522343cc813a2affcb9d37]$ gcc --version
gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-11)
Copyright (C) 2010 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

[thocking@lg-1r17-n04 67f8507fee522343cc813a2affcb9d37]$ Rscript -e 'devtools::install_github("Rdatatable/data.table")'
Downloading GitHub repo Rdatatable/data.table@master
from URL https://api.github.com/repos/Rdatatable/data.table/zipball/master
Installing data.table
'/home/thocking/lib64/R/bin/R' --no-site-file --no-environ --no-save  \
  --no-restore --quiet CMD INSTALL  \
  '/tmp/Rtmp3B4R7j/devtools61f4293cf614/Rdatatable-data.table-2b092fb'  \
  --library='/sb/home/thocking/lib64/R/library' --install-tests 

* installing *source* package ‘data.table’ ...
** libs
gcc -std=gnu99 -I/home/thocking/lib64/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c assign.c -o assign.o
gcc -std=gnu99 -I/home/thocking/lib64/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c between.c -o between.o
gcc -std=gnu99 -I/home/thocking/lib64/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c bmerge.c -o bmerge.o
gcc -std=gnu99 -I/home/thocking/lib64/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c chmatch.c -o chmatch.o
gcc -std=gnu99 -I/home/thocking/lib64/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c dogroups.c -o dogroups.o
gcc -std=gnu99 -I/home/thocking/lib64/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c fastmean.c -o fastmean.o
gcc -std=gnu99 -I/home/thocking/lib64/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c fcast.c -o fcast.o
gcc -std=gnu99 -I/home/thocking/lib64/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c fmelt.c -o fmelt.o
gcc -std=gnu99 -I/home/thocking/lib64/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c forder.c -o forder.o
gcc -std=gnu99 -I/home/thocking/lib64/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c frank.c -o frank.o
gcc -std=gnu99 -I/home/thocking/lib64/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c fread.c -o fread.o
gcc -std=gnu99 -I/home/thocking/lib64/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c fsort.c -o fsort.o
gcc -std=gnu99 -I/home/thocking/lib64/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c fwrite.c -o fwrite.o
gcc -std=gnu99 -I/home/thocking/lib64/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c gsumm.c -o gsumm.o
gcc -std=gnu99 -I/home/thocking/lib64/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c ijoin.c -o ijoin.o
gcc -std=gnu99 -I/home/thocking/lib64/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c init.c -o init.o
gcc -std=gnu99 -I/home/thocking/lib64/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c inrange.c -o inrange.o
gcc -std=gnu99 -I/home/thocking/lib64/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c openmp-utils.c -o openmp-utils.o
gcc -std=gnu99 -I/home/thocking/lib64/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c quickselect.c -o quickselect.o
gcc -std=gnu99 -I/home/thocking/lib64/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c rbindlist.c -o rbindlist.o
gcc -std=gnu99 -I/home/thocking/lib64/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c reorder.c -o reorder.o
gcc -std=gnu99 -I/home/thocking/lib64/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c shift.c -o shift.o
gcc -std=gnu99 -I/home/thocking/lib64/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c subset.c -o subset.o
gcc -std=gnu99 -I/home/thocking/lib64/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c transpose.c -o transpose.o
gcc -std=gnu99 -I/home/thocking/lib64/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c uniqlist.c -o uniqlist.o
gcc -std=gnu99 -I/home/thocking/lib64/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c vecseq.c -o vecseq.o
gcc -std=gnu99 -I/home/thocking/lib64/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c wrappers.c -o wrappers.o
gcc -std=gnu99 -shared -L/usr/local/lib64 -o data.table.so assign.o between.o bmerge.o chmatch.o dogroups.o fastmean.o fcast.o fmelt.o forder.o frank.o fread.o fsort.o fwrite.o gsumm.o ijoin.o init.o inrange.o openmp-utils.o quickselect.o rbindlist.o reorder.o shift.o subset.o transpose.o uniqlist.o vecseq.o wrappers.o -fopenmp
mv data.table.so datatable.so
if [ "" != "Windows_NT" ] && [ `uname -s` = 'Darwin' ]; then install_name_tool -id datatable.so datatable.so; fi
installing to /sb/home/thocking/lib64/R/library/data.table/libs
** R
** inst
** tests
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
* DONE (data.table)
[thocking@lg-1r17-n04 67f8507fee522343cc813a2affcb9d37]$ Rscript crash.R 
Loading required package: httr
Loading required package: requireGitHub

Attaching package: ‘requireGitHub’

The following object is masked _by_ ‘.GlobalEnv’:

    str_match_perl

Loading required package: data.table
Warning message:
In pkg_ok_have("R", Rvers, getRversion()) :
  works with R version 3.3.1, have 3.2.2
Session info -------------------------------------------------------------------
 setting  value                       
 version  R version 3.2.2 (2015-08-14)
 system   x86_64, linux-gnu           
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       <NA>                        
 date     2016-10-31                  

Packages -----------------------------------------------------------------------
 package       * version  date       source                                
 data.table    * 1.9.7    2016-11-01 Github (Rdatatable/data.table@2b092fb)
 devtools        1.12.0   2016-06-24 CRAN (R 3.2.2)                        
 digest          0.6.8    2014-12-31 CRAN (R 3.2.2)                        
 httr          * 1.0.0    2015-06-25 CRAN (R 3.2.2)                        
 magrittr        1.5      2014-11-22 CRAN (R 3.2.2)                        
 memoise         1.0.0    2016-01-29 CRAN (R 3.2.2)                        
 R6              2.1.1    2015-08-19 CRAN (R 3.2.2)                        
 requireGitHub * 2014.4.4 2016-05-11 Github (tdhock/requireGitHub@2a4c30a) 
 stringi         1.0-1    2015-10-22 CRAN (R 3.2.2)                        
 stringr         1.0.0    2015-04-30 CRAN (R 3.2.2)                        
 withr           1.0.2    2016-06-20 CRAN (R 3.2.2)                        
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.201169 GB.
Memory mapping ... ok
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Positioned on line 1 after skip or autostart
This line is the autostart and not blank so searching up for the last non-blank ... line 1
Detecting sep ... '\t'
Detected 5 columns. Longest stretch was from line 1 to line 30
Starting data input on line 1 (either column names or first row of data). First 10 characters: chr10	3306
Some fields on line 1 are not type character (or are empty). Treating as a data row and using default column names.
Count of eol: 6000000 (including 0 at the end)
Count of sep: 23999997
nrow = MIN( nsep [23999997] / (ncol [5] -1), neol [6000000] - endblanks [0] ) = 5999999
Type codes (point  0): 41143
Type codes (point  1): 41143
Type codes (point  2): 41143
Type codes (point  3): 41143
Type codes (point  4): 41143
Type codes (point  5): 41143
Type codes (point  6): 41143
Type codes (point  7): 41143
Type codes (point  8): 41143
Type codes (point  9): 41143
Couldn't guess column types from test point 10
Type codes: 41143 (after applying colClasses and integer64)
Type codes: 41143 (after applying drop or select (if supplied)
Allocating 5 column slots (5 - 0 dropped)
Read 5999999 rows and 5 (of 5) columns from 0.201 GB file in 00:00:04
Read 5999999 rows. Exactly what was estimated and allocated up front
   0.040s (  1%) Memory map (rerun may be quicker)
   0.000s (  0%) sep and header detection
   0.280s (  9%) Count rows (wc -l)
   0.000s (  0%) Column type detection (first, middle and last 5 rows)
   0.280s (  9%) Allocation of 5999999x5 result (xMB) in RAM
   2.680s ( 81%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.010s (  0%) Changing na.strings to NA
   3.290s        Total
            V1       V2       V3         V4  V5
      1: chr10 33061100 33061100       peak Inf
      2: chr10 33061100 33061100 background Inf
      3: chr10 33061100 33061100       peak Inf
      4: chr10 33061100 33061100 background Inf
      5: chr10 33061100 33061100       peak Inf
     ---                                       
5999995: chr10 33061100 33061100       peak Inf
5999996: chr10 33061100 33061100 background Inf
5999997: chr10 33061100 33061100       peak Inf
5999998: chr10 33061100 33061100 background Inf
5999999: chr10 33061100 33061100       peak Inf
Warning message:
In fread(cmd, verbose = TRUE) :
  Stopped reading at empty line 6000000 but text exists afterwards (discarded): chr10	
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.234697 GB.
Memory mapping ... ok
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Positioned on line 1 after skip or autostart
This line is the autostart and not blank so searching up for the last non-blank ... line 1
Detecting sep ... '\t'
Detected 5 columns. Longest stretch was from line 1 to line 30
Starting data input on line 1 (either column names or first row of data). First 10 characters: chr10	3306
Some fields on line 1 are not type character (or are empty). Treating as a data row and using default column names.
Count of eol: 7000000 (including 0 at the end)
Count of sep: 27999997
nrow = MIN( nsep [27999997] / (ncol [5] -1), neol [7000000] - endblanks [0] ) = 6999999
Type codes (point  0): 41143
Type codes (point  1): 41143
Type codes (point  2): 41143
Type codes (point  3): 41143
Type codes (point  4): 41143
Type codes (point  5): 41143
Type codes (point  6): 41143
Type codes (point  7): 41143
Type codes (point  8): 41143
Type codes (point  9): 41143
Couldn't guess column types from test point 10
Type codes: 41143 (after applying colClasses and integer64)
Type codes: 41143 (after applying drop or select (if supplied)
Allocating 5 column slots (5 - 0 dropped)
Read 6999999 rows and 5 (of 5) columns from 0.235 GB file in 00:00:04
Read 6999999 rows. Exactly what was estimated and allocated up front
   0.050s (  1%) Memory map (rerun may be quicker)
   0.000s (  0%) sep and header detection
   0.330s (  9%) Count rows (wc -l)
   0.000s (  0%) Column type detection (first, middle and last 5 rows)
   0.430s ( 11%) Allocation of 6999999x5 result (xMB) in RAM
   3.040s ( 79%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.010s (  0%) Changing na.strings to NA
   3.860s        Total
            V1       V2       V3         V4  V5
      1: chr10 33061100 33061100       peak Inf
      2: chr10 33061100 33061100 background Inf
      3: chr10 33061100 33061100       peak Inf
      4: chr10 33061100 33061100 background Inf
      5: chr10 33061100 33061100       peak Inf
     ---                                       
6999995: chr10 33061100 33061100       peak Inf
6999996: chr10 33061100 33061100 background Inf
6999997: chr10 33061100 33061100       peak Inf
6999998: chr10 33061100 33061100 background Inf
6999999: chr10 33061100 33061100       peak Inf
Warning message:
In fread(cmd, verbose = TRUE) :
  Stopped reading at empty line 7000000 but text exists afterwards (discarded): chr10	
[thocking@lg-1r17-n04 67f8507fee522343cc813a2affcb9d37]$ 
@tdhock
Copy link
Member Author

@tdhock tdhock commented Nov 1, 2016

to clarify, the gist contains two files

https://gist.github.com/tdhock/67f8507fee522343cc813a2affcb9d37/raw/8465f80bbee4fd26f01069a81fd19bc71013c71a/big_fread_crashes.txt.xz is the file that crashes fread

https://gist.github.com/tdhock/67f8507fee522343cc813a2affcb9d37/raw/8465f80bbee4fd26f01069a81fd19bc71013c71a/small_fread_ok.txt.xz is a control file -- it is slightly smaller than the other one, and it does not crash fread (even on my laptop)

@jangorecki
Copy link
Member

@jangorecki jangorecki commented Nov 1, 2016

Probably nothing to do with crash on binary data in a csv file, but it is recommended to use data.table from our package repository at https://rdatatable.github.io/data.table because master is not guaranteed to pass unit tests. Package is published to repository only if all tests are passing fine.

@tdhock
Copy link
Member Author

@tdhock tdhock commented Nov 1, 2016

another computer where the MRE does not result in a crash:

thocking@silene:~/PeakSegFPOP(robust-check)$ R --vanilla

R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> ### Write down what package versions work with your R code, and
> ### attempt to download and load those packages. The first argument is
> ### the version of R that you used, e.g. "3.0.2" and then the rest of
> ### the arguments are package versions. For
> ### CRAN/Bioconductor/R-Forge/etc packages, write
> ### e.g. RColorBrewer="1.0.5" and if RColorBrewer is not installed
> ### then we use install.packages to get the most recent version, and
> ### warn if the installed version is not the indicated version. For
> ### GitHub packages, write "user/repo@commit"
> ### e.g. "tdhock/animint@f877163cd181f390de3ef9a38bb8bdd0396d08a4" and
> ### we use install_github to get it, if necessary.
> works_with_R <- function(Rvers,...){
+   pkg_ok_have <- function(pkg,ok,have){
+     stopifnot(is.character(ok))
+     if(!as.character(have) %in% ok){
+       warning("works with ",pkg," version ",
+               paste(ok,collapse=" or "),
+               ", have ",have)
+     }
+   }
+   pkg_ok_have("R",Rvers,getRversion())
+   pkg.vers <- list(...)
+   for(pkg.i in seq_along(pkg.vers)){
+     vers <- pkg.vers[[pkg.i]]
+     pkg <- if(is.null(names(pkg.vers))){
+       ""
+     }else{
+       names(pkg.vers)[[pkg.i]]
+     }
+     if(pkg == ""){# Then it is from GitHub.
+       ## suppressWarnings is quieter than quiet.
+       if(!suppressWarnings(require(requireGitHub))){
+         ## If requireGitHub is not available, then install it using
+         ## devtools.
+         if(!suppressWarnings(require(devtools))){
+           install.packages("devtools")
+           require(devtools)
+         }
+         install_github("tdhock/requireGitHub")
+         require(requireGitHub)
+       }
+       requireGitHub(vers)
+     }else{# it is from a CRAN-like repos.
+       if(!suppressWarnings(require(pkg, character.only=TRUE))){
+         install.packages(pkg)
+       }
+       pkg_ok_have(pkg, vers, packageVersion(pkg))
+       library(pkg, character.only=TRUE)
+     }
+   }
+ }
> works_with_R(
+   "3.3.1",
+   httr="1.0.0",
+   "Rdatatable/data.table@2b092fbae4380acac66baf923436fe796ec823d8")
Loading required package: httr
Loading required package: requireGitHub
Loading required package: devtools
Downloading GitHub repo Rdatatable/data.table@2b092fbae4380acac66baf923436fe796ec823d8
from URL https://api.github.com/repos/Rdatatable/data.table/zipball/2b092fbae4380acac66baf923436fe796ec823d8
Installing data.table
'/home/thocking/lib/R/bin/R' --no-site-file --no-environ --no-save  \
  --no-restore --quiet CMD INSTALL  \
  '/tmp/RtmpqhSzAX/devtools4b62b28b40e/Rdatatable-data.table-2b092fb'  \
  --library='/home/thocking/lib/R/library' --install-tests 

* installing *source* package ‘data.table’ ...
** libs
gcc -std=gnu99 -I/home/thocking/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c assign.c -o assign.o
gcc -std=gnu99 -I/home/thocking/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c between.c -o between.o
gcc -std=gnu99 -I/home/thocking/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c bmerge.c -o bmerge.o
gcc -std=gnu99 -I/home/thocking/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c chmatch.c -o chmatch.o
gcc -std=gnu99 -I/home/thocking/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c dogroups.c -o dogroups.o
gcc -std=gnu99 -I/home/thocking/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c fastmean.c -o fastmean.o
gcc -std=gnu99 -I/home/thocking/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c fcast.c -o fcast.o
gcc -std=gnu99 -I/home/thocking/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c fmelt.c -o fmelt.o
gcc -std=gnu99 -I/home/thocking/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c forder.c -o forder.o
gcc -std=gnu99 -I/home/thocking/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c frank.c -o frank.o
gcc -std=gnu99 -I/home/thocking/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c fread.c -o fread.o
gcc -std=gnu99 -I/home/thocking/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c fsort.c -o fsort.o
gcc -std=gnu99 -I/home/thocking/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c fwrite.c -o fwrite.o
gcc -std=gnu99 -I/home/thocking/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c gsumm.c -o gsumm.o
gcc -std=gnu99 -I/home/thocking/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c ijoin.c -o ijoin.o
gcc -std=gnu99 -I/home/thocking/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c init.c -o init.o
gcc -std=gnu99 -I/home/thocking/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c inrange.c -o inrange.o
gcc -std=gnu99 -I/home/thocking/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c openmp-utils.c -o openmp-utils.o
gcc -std=gnu99 -I/home/thocking/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c quickselect.c -o quickselect.o
gcc -std=gnu99 -I/home/thocking/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c rbindlist.c -o rbindlist.o
gcc -std=gnu99 -I/home/thocking/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c reorder.c -o reorder.o
gcc -std=gnu99 -I/home/thocking/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c shift.c -o shift.o
gcc -std=gnu99 -I/home/thocking/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c subset.c -o subset.o
gcc -std=gnu99 -I/home/thocking/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c transpose.c -o transpose.o
gcc -std=gnu99 -I/home/thocking/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c uniqlist.c -o uniqlist.o
gcc -std=gnu99 -I/home/thocking/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c vecseq.c -o vecseq.o
gcc -std=gnu99 -I/home/thocking/lib/R/include -DNDEBUG  -I/usr/local/include   -fopenmp -fpic  -g -O2  -c wrappers.c -o wrappers.o
gcc -std=gnu99 -shared -L/usr/local/lib -o data.table.so assign.o between.o bmerge.o chmatch.o dogroups.o fastmean.o fcast.o fmelt.o forder.o frank.o fread.o fsort.o fwrite.o gsumm.o ijoin.o init.o inrange.o openmp-utils.o quickselect.o rbindlist.o reorder.o shift.o subset.o transpose.o uniqlist.o vecseq.o wrappers.o -fopenmp
mv data.table.so datatable.so
if [ "" != "Windows_NT" ] && [ `uname -s` = 'Darwin' ]; then install_name_tool -id datatable.so datatable.so; fi
installing to /home/thocking/lib/R/library/data.table/libs
** R
** inst
** tests
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
* DONE (data.table)
Loading required package: data.table
data.table 1.9.7 IN DEVELOPMENT built 2016-11-01 13:22:59 UTC
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
Warning message:
In pkg_ok_have(pkg, vers, packageVersion(pkg)) :
  works with httr version 1.0.0, have 1.2.1
> devtools::session_info()
Session info -------------------------------------------------------------------
 setting  value                       
 version  R version 3.3.1 (2016-06-21)
 system   x86_64, linux-gnu           
 ui       X11                         
 language en_CA:en                    
 collate  en_CA.UTF-8                 
 tz       <NA>                        
 date     2016-11-01                  

Packages -----------------------------------------------------------------------
 package       * version     date       source                                
 curl            1.1         2016-07-26 CRAN (R 3.2.3)                        
 data.table    * 1.9.7       2016-11-01 Github (Rdatatable/data.table@2b092fb)
 devtools      * 1.12.0.9000 2016-08-12 Github (hadley/devtools@565ac15)      
 digest          0.6.10      2016-08-02 CRAN (R 3.2.3)                        
 git2r           0.15.0      2016-05-11 CRAN (R 3.2.3)                        
 httr          * 1.2.1       2016-07-03 CRAN (R 3.2.3)                        
 knitr           1.12.3      2016-01-22 CRAN (R 3.2.3)                        
 memoise         1.0.0       2016-01-29 CRAN (R 3.2.3)                        
 R6              2.1.2       2016-01-26 CRAN (R 3.2.3)                        
 requireGitHub * 2014.4.4    2016-02-15 Github (tdhock/requireGitHub@2a4c30a) 
 withr           1.0.2       2016-06-20 CRAN (R 3.2.3)                        
> 
> download.xzcat.fread <- function(u){
+   request <- GET(u)
+   stop_for_status(request)
+   f <- sub(".*/", "", u)
+   writeBin(content(request), f)
+   cmd <- paste("xzcat", f)
+   fread(cmd, verbose=TRUE)
+ }
> 
> download.xzcat.fread("https://gist.github.com/tdhock/67f8507fee522343cc813a2afInput contains no \n. Taking this to be a filename to open_fread_ok.txt.xz")
File opened, filesize is 0.201169 GB.
Memory mapping ... ok
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Positioned on line 1 after skip or autostart
This line is the autostart and not blank so searching up for the last non-blank ... line 1
Detecting sep ... '\t'
Detected 5 columns. Longest stretch was from line 1 to line 30
Starting data input on line 1 (either column names or first row of data). First 10 characters: chr10    3306
Some fields on line 1 are not type character (or are empty). Treating as a data row and using default column names.
Count of eol: 6000000 (including 0 at the end)
Count of sep: 23999997
nrow = MIN( nsep [23999997] / (ncol [5] -1), neol [6000000] - endblanks [0] ) = 5999999
Type codes (point  0): 41143
Type codes (point  1): 41143
Type codes (point  2): 41143
Type codes (point  3): 41143
Type codes (point  4): 41143
Type codes (point  5): 41143
Type codes (point  6): 41143
Type codes (point  7): 41143
Type codes (point  8): 41143
Type codes (point  9): 41143
Couldn't guess column types from test point 10
Type codes: 41143 (after applying colClasses and integer64)
Type codes: 41143 (after applying drop or select (if supplied)
Allocating 5 column slots (5 - 0 dropped)
Read 5999999 rows and 5 (of 5) columns from 0.201 GB file in 00:00:03
Read 5999999 rows. Exactly what was estimated and allocated up front
   0.020s (  1%) Memory map (rerun may be quicker)
   0.000s (  0%) sep and header detection
   0.300s ( 11%) Count rows (wc -l)
   0.000s (  0%) Column type detection (first, middle and last 5 rows)
   0.330s ( 12%) Allocation of 5999999x5 result (xMB) in RAM
   2.110s ( 76%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.010s (  0%) Changing na.strings to NA
   2.770s        Total
            V1       V2       V3         V4  V5
      1: chr10 33061100 33061100       peak Inf
      2: chr10 33061100 33061100 background Inf
      3: chr10 33061100 33061100       peak Inf
      4: chr10 33061100 33061100 background Inf
      5: chr10 33061100 33061100       peak Inf
     ---                                       
5999995: chr10 33061100 33061100       peak Inf
5999996: chr10 33061100 33061100 background Inf
5999997: chr10 33061100 33061100       peak Inf
5999998: chr10 33061100 33061100 background Inf
5999999: chr10 33061100 33061100       peak Inf
Warning message:
In fread(cmd, verbose = TRUE) :
  Stopped reading at empty line 6000000 but text exists afterwards (discarded): chr10   
> 
> download.xzcat.fread("https://gist.github.com/tdhock/67f8507fee522343cc813a2afInput contains no \n. Taking this to be a filename to openread_crashes.txt.xz")
File opened, filesize is 0.234697 GB.
Memory mapping ... ok
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Positioned on line 1 after skip or autostart
This line is the autostart and not blank so searching up for the last non-blank ... line 1
Detecting sep ... '\t'
Detected 5 columns. Longest stretch was from line 1 to line 30
Starting data input on line 1 (either column names or first row of data). First 10 characters: chr10    3306
Some fields on line 1 are not type character (or are empty). Treating as a data row and using default column names.
Count of eol: 7000000 (including 0 at the end)
Count of sep: 27999997
nrow = MIN( nsep [27999997] / (ncol [5] -1), neol [7000000] - endblanks [0] ) = 6999999
Type codes (point  0): 41143
Type codes (point  1): 41143
Type codes (point  2): 41143
Type codes (point  3): 41143
Type codes (point  4): 41143
Type codes (point  5): 41143
Type codes (point  6): 41143
Type codes (point  7): 41143
Type codes (point  8): 41143
Type codes (point  9): 41143
Couldn't guess column types from test point 10
Type codes: 41143 (after applying colClasses and integer64)
Type codes: 41143 (after applying drop or select (if supplied)
Allocating 5 column slots (5 - 0 dropped)
Read 6999999 rows and 5 (of 5) columns from 0.235 GB file in 00:00:04
Read 6999999 rows. Exactly what was estimated and allocated up front
   0.020s (  1%) Memory map (rerun may be quicker)
   0.000s (  0%) sep and header detection
   0.370s ( 11%) Count rows (wc -l)
   0.000s (  0%) Column type detection (first, middle and last 5 rows)
   0.390s ( 12%) Allocation of 6999999x5 result (xMB) in RAM
   2.450s ( 76%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.010s (  0%) Changing na.strings to NA
   3.240s        Total
            V1       V2       V3         V4  V5
      1: chr10 33061100 33061100       peak Inf
      2: chr10 33061100 33061100 background Inf
      3: chr10 33061100 33061100       peak Inf
      4: chr10 33061100 33061100 background Inf
      5: chr10 33061100 33061100       peak Inf
     ---                                       
6999995: chr10 33061100 33061100       peak Inf
6999996: chr10 33061100 33061100 background Inf
6999997: chr10 33061100 33061100       peak Inf
6999998: chr10 33061100 33061100 background Inf
6999999: chr10 33061100 33061100       peak Inf
Warning message:
In fread(cmd, verbose = TRUE) :
  Stopped reading at empty line 7000000 but text exists afterwards (discarded): chr10   
> 
thocking@silene:~/PeakSegFPOP(robust-check)$ gcc --version
gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
Copyright (C) 2011 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

thocking@silene:~/PeakSegFPOP(robust-check)$ cat /proc/cpuinfo |head -20
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 26
model name  : Intel(R) Core(TM) i7 CPU         930  @ 2.80GHz
stepping    : 5
microcode   : 0xf
cpu MHz     : 1600.000
cache size  : 8192 KB
physical id : 0
siblings    : 8
core id     : 0
cpu cores   : 4
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 11
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida dtherm tpr_shadow vnmi flexpriority ept vpid
thocking@silene:~/PeakSegFPOP(robust-check)$ 

@tdhock
Copy link
Member Author

@tdhock tdhock commented Nov 1, 2016

on my laptop which exhibits the crash the only differences between the verbose output on the bad and control files are:

@@ -1,5 +1,5 @@
 Input contains no \n. Taking this to be a filename to open
-File opened, filesize is 0.201169 GB.
+File opened, filesize is 0.234697 GB.
 Memory mapping ... ok
 Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
 Positioned on line 1 after skip or autostart
@@ -8,9 +8,9 @@
 Detected 5 columns. Longest stretch was from line 1 to line 30
 Starting data input on line 1 (either column names or first row of data). First 10 characters: chr10   3306
 Some fields on line 1 are not type character (or are empty). Treating as a data row and using default column names.
-Count of eol: 6000000 (including 0 at the end)
-Count of sep: 23999997
-nrow = MIN( nsep [23999997] / (ncol [5] -1), neol [6000000] - endblanks [0] ) = 5999999
+Count of eol: 7000000 (including 0 at the end)
+Count of sep: 27999997
+nrow = MIN( nsep [27999997] / (ncol [5] -1), neol [7000000] - endblanks [0] ) = 6999999
 Type codes (point  0): 41143
 Type codes (point  1): 41143
 Type codes (point  2): 41143

@mattdowle mattdowle added this to the v1.9.8 milestone Nov 7, 2016
@mattdowle mattdowle added this to the v1.9.10 milestone Nov 23, 2016
@mattdowle mattdowle removed this from the v1.9.8 milestone Nov 23, 2016
@mattdowle mattdowle added the High label Nov 23, 2016
@tdhock
Copy link
Member Author

@tdhock tdhock commented Nov 24, 2016

by the way, if you guys are having a hard time replicating this on one of your computers, I would be more than happy to help by testing on my laptop, where the crash has been known to occur.

@tdhock
Copy link
Member Author

@tdhock tdhock commented Aug 28, 2017

I checked my old example with R-3.4.1, and the old version of data.table mentioned in that gist, and it is still giving the same segfault.

I then checked with the newest data.table version from Github, and it is now working! (no crash) So congratulations, something you guys did fixed the crash I was having.

However I was a bit surprised that fread did not give any warning about an unfinished line or binary data, since there is binary data on the last line of those files. For example I run xzcat big_fread_crashes.txt.xz | nl | tail at the end of this script https://gist.github.com/tdhock/67f8507fee522343cc813a2affcb9d37#file-datatable-works-r which shows an incomplete line and binary data at the end of the file (line 7,000,000). But no error/warning is reported by data.table -- is this a problem?

@tdhock
Copy link
Member Author

@tdhock tdhock commented Aug 28, 2017

here is a screenshot which shows the binary data at the end of the file https://gist.github.com/tdhock/67f8507fee522343cc813a2affcb9d37#file-big_binary_data_at_end-png

@st-pasha
Copy link
Contributor

@st-pasha st-pasha commented Nov 14, 2017

Checking with the latest version of data.table, I can confirm that both files can be read without crash. Option fill=TRUE has to be provided though, because last line is truncated.

As for the "binary data" at the end of the file(s) -- they are \0 bytes. Currently fread does not support inputs with embedded nul bytes. However in this case -- when nulled-out section appears at the end of the file, the correct approach is to simply to ignore them. So, almost accidentally, fread does the right thing in this case. Perhaps the only thing to improve is to provide a more explicit warning message (or verbose message).

@mattdowle mattdowle removed this from the Candidate milestone Nov 14, 2017
@mattdowle mattdowle added this to the v1.10.6 milestone Nov 14, 2017
@tdhock
Copy link
Member Author

@tdhock tdhock commented Nov 16, 2017

great thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants