Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading wide data use only one chunk (so only one thread) #2985

Closed
privefl opened this issue Jul 25, 2018 · 1 comment
Closed

Reading wide data use only one chunk (so only one thread) #2985

privefl opened this issue Jul 25, 2018 · 1 comment

Comments

@privefl
Copy link

privefl commented Jul 25, 2018

For the same size of data, reading a wide csv (1600 x 110,000) is much slower than reading a long csv (16,000,000 x 11). Between 10 and 15 times slower.

I've tested this on:

  • CentOS 6, R 3.5.0, {data.table} 1.11.4
  • Ubuntu 16.04.1, R 3.4.4, {data.table} 1.11.4 (CRAN) & 1.11.5 (GitHub)

Reproducible example and results:

nThread <- parallel::detectCores() - 1

# csv <- readr::readr_example("mtcars.csv")
# df <- data.table::fread(csv, data.table = FALSE)

## LONG CSV
csv2 <- "tmp-data/mtcars-long.csv"
# dir.create("tmp-data")
# data.table::fwrite(df[rep(seq_len(nrow(df)), 500000), ], csv2,
#                    quote = FALSE, row.names = FALSE)
system.time(dt2 <- data.table::fread(csv2, nThread = nThread))
#  user  system elapsed 
# 9.218   2.713   0.447 
dim(dt2)
# [1] 16000000       11


## WIDE CSV
csv3 <- "tmp-data/mtcars-wide.csv"
# data.table::fwrite(df[rep(seq_len(nrow(df)), 50), rep(seq_len(ncol(df)), 10000)], csv3,
#                    quote = FALSE, row.names = FALSE)
system.time(dt3 <- data.table::fread(csv3, nThread = nThread, verbose = TRUE))
# Input contains no \n. Taking this to be a filename to open
# [01] Check arguments
# Using 31 threads (omp_get_max_threads()=32, nth=31)
# NAstrings = [<<NA>>]
# None of the NAstrings look like numbers.
# show progress = 1
# 0/1 column will be read as integer
# [02] Opening the file
# Opening file tmp-data/mtcars-wide.csv
# File opened, size = 590MB (618987768 bytes).
# Memory mapped ok
# [03] Detect and skip BOM
# [04] Arrange mmap to be \0 terminated
# \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
# [05] Skipping initial rows if needed
# Positioned on line 1 starting: <<mpg,cyl,disp,hp,drat,wt,qsec,v>>
#   [06] Detect separator, quoting rule, and ncolumns
# Detecting sep automatically ...
# sep=','  with 100 lines of 110000 fields using quote rule 0
# Detected 110000 columns on line 1. This line is either column names or first data row. Line starts as: <<mpg,cyl,disp,hp,drat,wt,qsec,v>>
#   Quote rule picked = 0
# fill=false and the most number of columns found is 110000
# [07] Detect column types, good nrow estimate and whether first row is column names
# Number of sampling jump points = 1 because (618987767 bytes from row 1 to eof) / (2 * 39177768 jump0size) == 7
# Type codes (jump 000)    : 75757775555757577755557575777555575757775555757577755557575777555575757775555757...5757775555  Quote rule 0
# Type codes (jump 001)    : 75757775555757577755557575777555575757775555757577755557575777555575757775555757...5757775555  Quote rule 0
# 'header' determined to be true due to column 1 containing a string on row 1 and a lower type (float64) in the rest of the 150 sample rows
# =====
#   Sampled 150 rows (handled \n inside quoted fields) at 2 jump points
# Bytes from first data row on line 2 to the end of last row: 617999999
# Line length: mean=385933.33 sd=-nan min=360000 max=400000
# Estimated number of rows: 617999999 / 385933.33 = 1602
# Initial alloc = 1762 rows (1602 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
# =====
#   [08] Assign column names
# [09] Apply user overrides on column types
# After 0 type and 0 drop user overrides : 75757775555757577755557575777555575757775555757577755557575777555575757775555757...5757775555
# [10] Allocate memory for the datatable
# Allocating 110000 column slots (110000 - 0 dropped) with 1762 rows
# [11] Read the data
# jumps=[0..1), chunk_size=385933326, total_size=617999999
# Read 1600 rows x 110000 columns from 590MB (618987768 bytes) file in 00:05.433 wall clock time
# [12] Finalizing the datatable
# Type counts:
#   60000 : int32     '5'
# 50000 : float64   '7'
# =============================
#   0.000s (  0%) Memory map 0.576GB file
# 1.419s ( 26%) sep=',' ncol=110000 and header detection
# 0.009s (  0%) Column type detection using 150 sample rows
# 0.095s (  2%) Allocation of 1762 rows x 110000 cols (1.052GB) of which 1600 ( 91%) rows used
# 3.909s ( 72%) Reading 1 chunks (0 swept) of 368.055MB (each chunk 1600 rows) using 1 threads
# +    3.114s ( 57%) Parse to row-major thread buffers (grown 0 times)
# +    0.793s ( 15%) Transpose
# +    0.002s (  0%) Waiting
# 0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
# 5.433s        Total
# user  system elapsed 
# 5.163   0.537   5.701 
dim(dt3)
# [1]   1600 110000
@MichaelChirico
Copy link
Member

This is basically to be expected, since data.table parallelizes over rows and not columns. Having the routine dynamically optimize over the dimension of parallelization could be considered but I think it's out of scope. It's not very common to have such wide tables where it matters, and supporting this will require a lot of maintenance overhead. If you want to write a PR demonstrating what code would be needed to support this, please go ahead, but the current maintainer core will not pursue this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants