You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For the same size of data, reading a wide csv (1600 x 110,000) is much slower than reading a long csv (16,000,000 x 11). Between 10 and 15 times slower.
I've tested this on:
CentOS 6, R 3.5.0, {data.table} 1.11.4
Ubuntu 16.04.1, R 3.4.4, {data.table} 1.11.4 (CRAN) & 1.11.5 (GitHub)
Reproducible example and results:
nThread <- parallel::detectCores() - 1
# csv <- readr::readr_example("mtcars.csv")
# df <- data.table::fread(csv, data.table = FALSE)
## LONG CSV
csv2 <- "tmp-data/mtcars-long.csv"
# dir.create("tmp-data")
# data.table::fwrite(df[rep(seq_len(nrow(df)), 500000), ], csv2,
# quote = FALSE, row.names = FALSE)
system.time(dt2 <- data.table::fread(csv2, nThread = nThread))
# user system elapsed
# 9.218 2.713 0.447
dim(dt2)
# [1] 16000000 11
## WIDE CSV
csv3 <- "tmp-data/mtcars-wide.csv"
# data.table::fwrite(df[rep(seq_len(nrow(df)), 50), rep(seq_len(ncol(df)), 10000)], csv3,
# quote = FALSE, row.names = FALSE)
system.time(dt3 <- data.table::fread(csv3, nThread = nThread, verbose = TRUE))
# Input contains no \n. Taking this to be a filename to open
# [01] Check arguments
# Using 31 threads (omp_get_max_threads()=32, nth=31)
# NAstrings = [<<NA>>]
# None of the NAstrings look like numbers.
# show progress = 1
# 0/1 column will be read as integer
# [02] Opening the file
# Opening file tmp-data/mtcars-wide.csv
# File opened, size = 590MB (618987768 bytes).
# Memory mapped ok
# [03] Detect and skip BOM
# [04] Arrange mmap to be \0 terminated
# \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
# [05] Skipping initial rows if needed
# Positioned on line 1 starting: <<mpg,cyl,disp,hp,drat,wt,qsec,v>>
# [06] Detect separator, quoting rule, and ncolumns
# Detecting sep automatically ...
# sep=',' with 100 lines of 110000 fields using quote rule 0
# Detected 110000 columns on line 1. This line is either column names or first data row. Line starts as: <<mpg,cyl,disp,hp,drat,wt,qsec,v>>
# Quote rule picked = 0
# fill=false and the most number of columns found is 110000
# [07] Detect column types, good nrow estimate and whether first row is column names
# Number of sampling jump points = 1 because (618987767 bytes from row 1 to eof) / (2 * 39177768 jump0size) == 7
# Type codes (jump 000) : 75757775555757577755557575777555575757775555757577755557575777555575757775555757...5757775555 Quote rule 0
# Type codes (jump 001) : 75757775555757577755557575777555575757775555757577755557575777555575757775555757...5757775555 Quote rule 0
# 'header' determined to be true due to column 1 containing a string on row 1 and a lower type (float64) in the rest of the 150 sample rows
# =====
# Sampled 150 rows (handled \n inside quoted fields) at 2 jump points
# Bytes from first data row on line 2 to the end of last row: 617999999
# Line length: mean=385933.33 sd=-nan min=360000 max=400000
# Estimated number of rows: 617999999 / 385933.33 = 1602
# Initial alloc = 1762 rows (1602 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
# =====
# [08] Assign column names
# [09] Apply user overrides on column types
# After 0 type and 0 drop user overrides : 75757775555757577755557575777555575757775555757577755557575777555575757775555757...5757775555
# [10] Allocate memory for the datatable
# Allocating 110000 column slots (110000 - 0 dropped) with 1762 rows
# [11] Read the data
# jumps=[0..1), chunk_size=385933326, total_size=617999999
# Read 1600 rows x 110000 columns from 590MB (618987768 bytes) file in 00:05.433 wall clock time
# [12] Finalizing the datatable
# Type counts:
# 60000 : int32 '5'
# 50000 : float64 '7'
# =============================
# 0.000s ( 0%) Memory map 0.576GB file
# 1.419s ( 26%) sep=',' ncol=110000 and header detection
# 0.009s ( 0%) Column type detection using 150 sample rows
# 0.095s ( 2%) Allocation of 1762 rows x 110000 cols (1.052GB) of which 1600 ( 91%) rows used
# 3.909s ( 72%) Reading 1 chunks (0 swept) of 368.055MB (each chunk 1600 rows) using 1 threads
# + 3.114s ( 57%) Parse to row-major thread buffers (grown 0 times)
# + 0.793s ( 15%) Transpose
# + 0.002s ( 0%) Waiting
# 0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions
# 5.433s Total
# user system elapsed
# 5.163 0.537 5.701
dim(dt3)
# [1] 1600 110000
The text was updated successfully, but these errors were encountered:
This is basically to be expected, since data.table parallelizes over rows and not columns. Having the routine dynamically optimize over the dimension of parallelization could be considered but I think it's out of scope. It's not very common to have such wide tables where it matters, and supporting this will require a lot of maintenance overhead. If you want to write a PR demonstrating what code would be needed to support this, please go ahead, but the current maintainer core will not pursue this.
For the same size of data, reading a wide csv (1600 x 110,000) is much slower than reading a long csv (16,000,000 x 11). Between 10 and 15 times slower.
I've tested this on:
Reproducible example and results:
The text was updated successfully, but these errors were encountered: