CSV.read is still incredibly slow on Windows with 0.4.1 #15

kafisatz · 2015-11-10T08:32:45Z

I just did a test on a 1000 row dataset.
I needed to set type detect to 1000 which may be slightly "unfair" compared to readcsv which has no types. Still CSV.read is far to slow:
** it takes 13 seconds instead of 0.04s **

I note that the functions were already compiled in the example below.

I was hoping that this works better now, as you indicated here:
https://groups.google.com/forum/#!searchin/julia-users/csv/julia-users/IFkPso4JUac/lNLgLoCqAwAJ

Any hints?

julia> f="T:\temp\julia1k.csv"
"T:\temp\julia1k.csv"

julia> @time f1=readcsv(f);
0.043854 seconds (239.86 k allocations: 8.536 MB)

julia> @time df=readtable(f);
0.039639 seconds (221.93 k allocations: 10.359 MB, 15.51% gc time)

julia> @time f2=CSV.read(f,rows_for_type_detect=1000);
13.760476 seconds (1.79 M allocations: 73.616 MB, 0.12% gc time)

julia> @show size(f1),size(f2),size(df)
(size(f1),size(f2),size(df)) = ((1000,77),(999,77),(999,77))
((1000,77),(999,77),(999,77))

julia> versioninfo(true)
Julia Version 0.4.1
Commit cbe1bee* (2015-11-08 10:33 UTC)
Platform Info:
System: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
WORD_SIZE: 64
Microsoft Windows [Version 6.1.7601]
uname: MSYS_NT-6.1 2.3.0(0.290/5/3) 2015-09-29 10:48 x86_64 unknown
Memory: 31.694698333740234 GB (26403.6875 MB free)
Uptime: 1.1864766877332e6 sec
Load Avg: 0.0 0.0 0.0
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz:
speed user nice sys idle irq ticks
#1 3410 MHz 4675942 0 2720907 1179080330 152865 ticks
#2 3410 MHz 609105 0 854667 1185013080 87454 ticks
#3 3410 MHz 6070357 0 9145699 1171260702 124348 ticks
#4 3410 MHz 786603 0 1347911 1184342104 18033 ticks
#5 3410 MHz 6533228 0 11563262 1168380019 145923 ticks
#6 3410 MHz 106033 0 37487 1186332833 1404 ticks
#7 3410 MHz 5059143 0 8723326 1172693743 114894 ticks
#8 3410 MHz 2913786 0 1522086 1182040247 29094 ticks

BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.3
Environment:
.CLASSPATH = C:\Users\workstation\Documents\mongojdbcdriver
CLASSPATH = C:\Users\workstation\Documents\mongojdbcdriver
GROOVY_HOME = C:\Program Files (x86)\Groovy\Groovy-2.2.2
HOMEDRIVE = C:
HOMEPATH = \Users\workstation
JAVA_HOME = C:\Program Files\Java\jre8
JULIA_HOME = C:\Program Files\Juno\resources\app\julia\bin
PATHEXT = .COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC;.groovy;.gy

Package Directory: C:\Users\workstation.julia
27 required packages:

Coverage 0.2.3
DataFrames 0.6.10
Dates 0.4.4
FastAnonymous 0.3.2
Gadfly 0.3.18
Graphics 0.1.3
HDF5 0.5.6
Iterators 0.1.9
JLD 0.5.6
JSON 0.5.0
Jewel 1.0.7
Libz 0.0.2
Lint 0.1.68
Loess 0.0.5
Mocha 0.1.0
NumericExtensions 0.6.2
ODBC 0.3.10
ProfileView 0.1.1
ProgressMeter 0.2.2
PyCall 1.2.0
RDatasets 0.1.2
SQLite 0.3.0
SortingAlgorithms 0.0.6
StatsBase 0.7.4
TypeCheck 0.0.3
WinRPM 0.1.13
ZipFile 0.2.5
57 additional packages:
ArrayViews 0.6.4
BinDeps 0.3.19
Blosc 0.1.4
BufferedStreams 0.0.2
CSV 0.0.2
Cairo 0.2.31
Calculus 0.1.14
Codecs 0.1.5
ColorTypes 0.2.0
Colors 0.6.0
Compat 0.7.7
Compose 0.3.18
Conda 0.1.8
Contour 0.0.8
DataArrays 0.2.20
DataStreams 0.0.2
DataStructures 0.3.13
Debug 0.1.6
Distances 0.2.1
Distributions 0.8.7
Docile 0.5.19
DualNumbers 0.1.5
FactCheck 0.4.1
FileIO 0.0.3
FixedPointNumbers 0.1.1
GZip 0.2.18
Grid 0.4.0
Gtk 0.9.2
GtkUtilities 0.0.6
Hexagons 0.0.4
HttpCommon 0.2.4
HttpParser 0.1.1
ImmutableArrays 0.0.11
JuliaParser 0.6.3
KernelDensity 0.1.2
LNR 0.0.2
Lazy 0.10.1
LibExpat 0.1.0
Logging 0.2.0
MacroTools 0.2.0
MbedTLS 0.2.0
MySQL 0.0.0- master (unregistered, dirty)
NaNMath 0.1.1
NullableArrays 0.0.2
NumericFuns 0.2.4
Optim 0.4.4
PDMats 0.3.6
Reexport 0.0.3
Requests 0.3.2
Requires 0.2.1
SHA 0.1.2
Showoff 0.0.6
StatsFuns 0.2.0
URIParser 0.1.1
WoodburyMatrices 0.1.2
Zlib 0.1.12
lib 0.0.0- non-repo (unregistered)

julia>

quinnj · 2015-11-11T16:09:45Z

Should be closed by JuliaLang/METADATA.jl#4014

quinnj · 2015-11-11T16:10:07Z

(i.e. just do a Pkg.update() now and you should have the latest version)

kafisatz · 2015-12-04T06:55:28Z

@quinnj

This is not resolved:
I tested again today (on two different Win7 machines). Here are the warmed up timings of a 1000 row file:

julia> f="c:\temp\julia1k.csv"
"c:\temp\julia1k.csv"

julia> @time f1=readcsv(f);
0.052015 seconds (239.86 k allocations: 8.536 MB)

julia> @time df=readtable(f);
0.064770 seconds (224.02 k allocations: 10.468 MB, 18.82% gc time)

julia> @time f2=CSV.read(f,rows_for_type_detect=1000);
11.974838 seconds (1.81 M allocations: 74.173 MB, 0.12% gc time)

julia> size(f1)
(1000,77)

julia> size(df)
(999,77)

I apologize to the user @time for pinging her/him

quinnj · 2016-04-30T20:27:52Z

@kafisatz, can you share some more details around the slowness you were seeing? Namely:

Julia/package versions (versioninfo(true))

I tried to dig into this again this morning, but I'm seeing comparable parsing times between my mac and a windows machine on Julia 0.4.1, and latest CSV master (Pkg.checkout("CSV"))

kafisatz · 2016-05-01T10:53:01Z

Hi quinn. It is very fast now.
What I do not fully understand is how to get the data in a format which is usable for me as a layman (e.g. an array, a vector of vectors, a dataframe or something similar)

csv.csv takes extremely long compared to readtable (dataframe), see below.
the file I read has 100'000 rows and 77 columns

julia> @time f1=readcsv(f);
3.356507 seconds (27.93 M allocations: 913.321 MB, 5.40% gc time)

julia> @time f2=CSV.read(f);
0.016803 seconds (18 allocations: 41.813 MB)

julia> @time dt=CSV.csv(f,rows_for_type_detect=10000)
164.831213 seconds (25.93 M allocations: 940.161 MB, 0.43% gc time)

julia> @time df=readtable(f);
3.640882 seconds (26.84 M allocations: 969.908 MB)

quinnj · 2016-05-01T13:10:51Z

hey @kafisatz, a couple of things here:

using such a high number for rows_for_typedectect will always make it pretty slow, if you're having trouble getting the right types, it's much faster/easier to use the types argument, something like types=Dict(1=>Float64) in order to specify that the first column should be Float64
There's currently a bug where CSV.read is not actually calling in CSV, but just the regular Base.read method, which just reads the file in as an array of bytes; I should probably try to make that an error somehow.
I'll add some more documentation for this, but it's pretty painless to convert the result of CSV.csv to a DataFrame, for example:

using DataFrames
using CSV
dt = CSV.csv("myfile.csv")
df = DataFrame(dt)  # converts our Data.Table `dt` to a DataFrame without copying

quinnj closed this as completed Nov 11, 2015

quinnj reopened this Apr 30, 2016

kafisatz closed this as completed May 1, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV.read is still incredibly slow on Windows with 0.4.1 #15

CSV.read is still incredibly slow on Windows with 0.4.1 #15

kafisatz commented Nov 10, 2015

quinnj commented Nov 11, 2015

quinnj commented Nov 11, 2015

kafisatz commented Dec 4, 2015

quinnj commented Apr 30, 2016 •

edited

Loading

kafisatz commented May 1, 2016

quinnj commented May 1, 2016

CSV.read is still incredibly slow on Windows with 0.4.1 #15

CSV.read is still incredibly slow on Windows with 0.4.1 #15

Comments

kafisatz commented Nov 10, 2015

quinnj commented Nov 11, 2015

quinnj commented Nov 11, 2015

kafisatz commented Dec 4, 2015

quinnj commented Apr 30, 2016 • edited Loading

kafisatz commented May 1, 2016

quinnj commented May 1, 2016

quinnj commented Apr 30, 2016 •

edited

Loading