Reading wide data use only one chunk (so only one thread) #2985

privefl · 2018-07-25T19:03:06Z

For the same size of data, reading a wide csv (1600 x 110,000) is much slower than reading a long csv (16,000,000 x 11). Between 10 and 15 times slower.

I've tested this on:

CentOS 6, R 3.5.0, {data.table} 1.11.4
Ubuntu 16.04.1, R 3.4.4, {data.table} 1.11.4 (CRAN) & 1.11.5 (GitHub)

Reproducible example and results:

nThread <- parallel::detectCores() - 1

# csv <- readr::readr_example("mtcars.csv")
# df <- data.table::fread(csv, data.table = FALSE)

## LONG CSV
csv2 <- "tmp-data/mtcars-long.csv"
# dir.create("tmp-data")
# data.table::fwrite(df[rep(seq_len(nrow(df)), 500000), ], csv2,
#                    quote = FALSE, row.names = FALSE)
system.time(dt2 <- data.table::fread(csv2, nThread = nThread))
#  user  system elapsed 
# 9.218   2.713   0.447 
dim(dt2)
# [1] 16000000       11


## WIDE CSV
csv3 <- "tmp-data/mtcars-wide.csv"
# data.table::fwrite(df[rep(seq_len(nrow(df)), 50), rep(seq_len(ncol(df)), 10000)], csv3,
#                    quote = FALSE, row.names = FALSE)
system.time(dt3 <- data.table::fread(csv3, nThread = nThread, verbose = TRUE))
# Input contains no \n. Taking this to be a filename to open
# [01] Check arguments
# Using 31 threads (omp_get_max_threads()=32, nth=31)
# NAstrings = [<<NA>>]
# None of the NAstrings look like numbers.
# show progress = 1
# 0/1 column will be read as integer
# [02] Opening the file
# Opening file tmp-data/mtcars-wide.csv
# File opened, size = 590MB (618987768 bytes).
# Memory mapped ok
# [03] Detect and skip BOM
# [04] Arrange mmap to be \0 terminated
# \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
# [05] Skipping initial rows if needed
# Positioned on line 1 starting: <<mpg,cyl,disp,hp,drat,wt,qsec,v>>
#   [06] Detect separator, quoting rule, and ncolumns
# Detecting sep automatically ...
# sep=','  with 100 lines of 110000 fields using quote rule 0
# Detected 110000 columns on line 1. This line is either column names or first data row. Line starts as: <<mpg,cyl,disp,hp,drat,wt,qsec,v>>
#   Quote rule picked = 0
# fill=false and the most number of columns found is 110000
# [07] Detect column types, good nrow estimate and whether first row is column names
# Number of sampling jump points = 1 because (618987767 bytes from row 1 to eof) / (2 * 39177768 jump0size) == 7
# Type codes (jump 000)    : 75757775555757577755557575777555575757775555757577755557575777555575757775555757...5757775555  Quote rule 0
# Type codes (jump 001)    : 75757775555757577755557575777555575757775555757577755557575777555575757775555757...5757775555  Quote rule 0
# 'header' determined to be true due to column 1 containing a string on row 1 and a lower type (float64) in the rest of the 150 sample rows
# =====
#   Sampled 150 rows (handled \n inside quoted fields) at 2 jump points
# Bytes from first data row on line 2 to the end of last row: 617999999
# Line length: mean=385933.33 sd=-nan min=360000 max=400000
# Estimated number of rows: 617999999 / 385933.33 = 1602
# Initial alloc = 1762 rows (1602 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
# =====
#   [08] Assign column names
# [09] Apply user overrides on column types
# After 0 type and 0 drop user overrides : 75757775555757577755557575777555575757775555757577755557575777555575757775555757...5757775555
# [10] Allocate memory for the datatable
# Allocating 110000 column slots (110000 - 0 dropped) with 1762 rows
# [11] Read the data
# jumps=[0..1), chunk_size=385933326, total_size=617999999
# Read 1600 rows x 110000 columns from 590MB (618987768 bytes) file in 00:05.433 wall clock time
# [12] Finalizing the datatable
# Type counts:
#   60000 : int32     '5'
# 50000 : float64   '7'
# =============================
#   0.000s (  0%) Memory map 0.576GB file
# 1.419s ( 26%) sep=',' ncol=110000 and header detection
# 0.009s (  0%) Column type detection using 150 sample rows
# 0.095s (  2%) Allocation of 1762 rows x 110000 cols (1.052GB) of which 1600 ( 91%) rows used
# 3.909s ( 72%) Reading 1 chunks (0 swept) of 368.055MB (each chunk 1600 rows) using 1 threads
# +    3.114s ( 57%) Parse to row-major thread buffers (grown 0 times)
# +    0.793s ( 15%) Transpose
# +    0.002s (  0%) Waiting
# 0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
# 5.433s        Total
# user  system elapsed 
# 5.163   0.537   5.701 
dim(dt3)
# [1]   1600 110000

The text was updated successfully, but these errors were encountered:

MichaelChirico · 2024-04-19T19:42:43Z

This is basically to be expected, since data.table parallelizes over rows and not columns. Having the routine dynamically optimize over the dimension of parallelization could be considered but I think it's out of scope. It's not very common to have such wide tables where it matters, and supporting this will require a lot of maintenance overhead. If you want to write a PR demonstrating what code would be needed to support this, please go ahead, but the current maintainer core will not pursue this.

privefl mentioned this issue Jul 25, 2018

fread with large csv (44 GB) takes a lot of RAM in latest data.table dev version #2073

Closed

st-pasha added the fread label Jul 25, 2018

jangorecki added the performance label Aug 15, 2018

MichaelChirico closed this as completed Apr 19, 2024

MichaelChirico added the out-of-scope label Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading wide data use only one chunk (so only one thread) #2985

Reading wide data use only one chunk (so only one thread) #2985

privefl commented Jul 25, 2018

MichaelChirico commented Apr 19, 2024

Reading wide data use only one chunk (so only one thread) #2985

Reading wide data use only one chunk (so only one thread) #2985

Comments

privefl commented Jul 25, 2018

MichaelChirico commented Apr 19, 2024