Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV.read is still incredibly slow on Windows with 0.4.1 #15

Closed
kafisatz opened this issue Nov 10, 2015 · 6 comments
Closed

CSV.read is still incredibly slow on Windows with 0.4.1 #15

kafisatz opened this issue Nov 10, 2015 · 6 comments

Comments

@kafisatz
Copy link
Contributor

@quinnj

I just did a test on a 1000 row dataset.
I needed to set type detect to 1000 which may be slightly "unfair" compared to readcsv which has no types. Still CSV.read is far to slow:
** it takes 13 seconds instead of 0.04s **

I note that the functions were already compiled in the example below.

I was hoping that this works better now, as you indicated here:
https://groups.google.com/forum/#!searchin/julia-users/csv/julia-users/IFkPso4JUac/lNLgLoCqAwAJ

Any hints?

julia> f="T:\temp\julia1k.csv"
"T:\temp\julia1k.csv"

julia> @time f1=readcsv(f);
0.043854 seconds (239.86 k allocations: 8.536 MB)

julia> @time df=readtable(f);
0.039639 seconds (221.93 k allocations: 10.359 MB, 15.51% gc time)

julia> @time f2=CSV.read(f,rows_for_type_detect=1000);
13.760476 seconds (1.79 M allocations: 73.616 MB, 0.12% gc time)

julia> @show size(f1),size(f2),size(df)
(size(f1),size(f2),size(df)) = ((1000,77),(999,77),(999,77))
((1000,77),(999,77),(999,77))

julia> versioninfo(true)
Julia Version 0.4.1
Commit cbe1bee* (2015-11-08 10:33 UTC)
Platform Info:
System: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
WORD_SIZE: 64
Microsoft Windows [Version 6.1.7601]
uname: MSYS_NT-6.1 2.3.0(0.290/5/3) 2015-09-29 10:48 x86_64 unknown
Memory: 31.694698333740234 GB (26403.6875 MB free)
Uptime: 1.1864766877332e6 sec
Load Avg: 0.0 0.0 0.0
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz:
speed user nice sys idle irq ticks
#1 3410 MHz 4675942 0 2720907 1179080330 152865 ticks
#2 3410 MHz 609105 0 854667 1185013080 87454 ticks
#3 3410 MHz 6070357 0 9145699 1171260702 124348 ticks
#4 3410 MHz 786603 0 1347911 1184342104 18033 ticks
#5 3410 MHz 6533228 0 11563262 1168380019 145923 ticks
#6 3410 MHz 106033 0 37487 1186332833 1404 ticks
#7 3410 MHz 5059143 0 8723326 1172693743 114894 ticks
#8 3410 MHz 2913786 0 1522086 1182040247 29094 ticks

BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.3
Environment:
.CLASSPATH = C:\Users\workstation\Documents\mongojdbcdriver
CLASSPATH = C:\Users\workstation\Documents\mongojdbcdriver
GROOVY_HOME = C:\Program Files (x86)\Groovy\Groovy-2.2.2
HOMEDRIVE = C:
HOMEPATH = \Users\workstation
JAVA_HOME = C:\Program Files\Java\jre8
JULIA_HOME = C:\Program Files\Juno\resources\app\julia\bin
PATHEXT = .COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC;.groovy;.gy

Package Directory: C:\Users\workstation.julia
27 required packages:

  • Coverage 0.2.3
  • DataFrames 0.6.10
  • Dates 0.4.4
  • FastAnonymous 0.3.2
  • Gadfly 0.3.18
  • Graphics 0.1.3
  • HDF5 0.5.6
  • Iterators 0.1.9
  • JLD 0.5.6
  • JSON 0.5.0
  • Jewel 1.0.7
  • Libz 0.0.2
  • Lint 0.1.68
  • Loess 0.0.5
  • Mocha 0.1.0
  • NumericExtensions 0.6.2
  • ODBC 0.3.10
  • ProfileView 0.1.1
  • ProgressMeter 0.2.2
  • PyCall 1.2.0
  • RDatasets 0.1.2
  • SQLite 0.3.0
  • SortingAlgorithms 0.0.6
  • StatsBase 0.7.4
  • TypeCheck 0.0.3
  • WinRPM 0.1.13
  • ZipFile 0.2.5
    57 additional packages:
  • ArrayViews 0.6.4
  • BinDeps 0.3.19
  • Blosc 0.1.4
  • BufferedStreams 0.0.2
  • CSV 0.0.2
  • Cairo 0.2.31
  • Calculus 0.1.14
  • Codecs 0.1.5
  • ColorTypes 0.2.0
  • Colors 0.6.0
  • Compat 0.7.7
  • Compose 0.3.18
  • Conda 0.1.8
  • Contour 0.0.8
  • DataArrays 0.2.20
  • DataStreams 0.0.2
  • DataStructures 0.3.13
  • Debug 0.1.6
  • Distances 0.2.1
  • Distributions 0.8.7
  • Docile 0.5.19
  • DualNumbers 0.1.5
  • FactCheck 0.4.1
  • FileIO 0.0.3
  • FixedPointNumbers 0.1.1
  • GZip 0.2.18
  • Grid 0.4.0
  • Gtk 0.9.2
  • GtkUtilities 0.0.6
  • Hexagons 0.0.4
  • HttpCommon 0.2.4
  • HttpParser 0.1.1
  • ImmutableArrays 0.0.11
  • JuliaParser 0.6.3
  • KernelDensity 0.1.2
  • LNR 0.0.2
  • Lazy 0.10.1
  • LibExpat 0.1.0
  • Logging 0.2.0
  • MacroTools 0.2.0
  • MbedTLS 0.2.0
  • MySQL 0.0.0- master (unregistered, dirty)
  • NaNMath 0.1.1
  • NullableArrays 0.0.2
  • NumericFuns 0.2.4
  • Optim 0.4.4
  • PDMats 0.3.6
  • Reexport 0.0.3
  • Requests 0.3.2
  • Requires 0.2.1
  • SHA 0.1.2
  • Showoff 0.0.6
  • StatsFuns 0.2.0
  • URIParser 0.1.1
  • WoodburyMatrices 0.1.2
  • Zlib 0.1.12
  • lib 0.0.0- non-repo (unregistered)

julia>

@quinnj
Copy link
Member

quinnj commented Nov 11, 2015

Should be closed by JuliaLang/METADATA.jl#4014

@quinnj quinnj closed this as completed Nov 11, 2015
@quinnj
Copy link
Member

quinnj commented Nov 11, 2015

(i.e. just do a Pkg.update() now and you should have the latest version)

@kafisatz
Copy link
Contributor Author

kafisatz commented Dec 4, 2015

@quinnj

This is not resolved:
I tested again today (on two different Win7 machines). Here are the warmed up timings of a 1000 row file:

julia> f="c:\temp\julia1k.csv"
"c:\temp\julia1k.csv"

julia> @time f1=readcsv(f);
0.052015 seconds (239.86 k allocations: 8.536 MB)

julia> @time df=readtable(f);
0.064770 seconds (224.02 k allocations: 10.468 MB, 18.82% gc time)

julia> @time f2=CSV.read(f,rows_for_type_detect=1000);
11.974838 seconds (1.81 M allocations: 74.173 MB, 0.12% gc time)

julia> size(f1)
(1000,77)

julia> size(df)
(999,77)

I apologize to the user @time for pinging her/him

@quinnj quinnj reopened this Apr 30, 2016
@quinnj
Copy link
Member

quinnj commented Apr 30, 2016

@kafisatz, can you share some more details around the slowness you were seeing? Namely:

  • Julia/package versions (versioninfo(true))

I tried to dig into this again this morning, but I'm seeing comparable parsing times between my mac and a windows machine on Julia 0.4.1, and latest CSV master (Pkg.checkout("CSV"))

@kafisatz
Copy link
Contributor Author

kafisatz commented May 1, 2016

Hi quinn. It is very fast now.
What I do not fully understand is how to get the data in a format which is usable for me as a layman (e.g. an array, a vector of vectors, a dataframe or something similar)

csv.csv takes extremely long compared to readtable (dataframe), see below.
the file I read has 100'000 rows and 77 columns

julia> @time f1=readcsv(f);
3.356507 seconds (27.93 M allocations: 913.321 MB, 5.40% gc time)

julia> @time f2=CSV.read(f);
0.016803 seconds (18 allocations: 41.813 MB)

julia> @time dt=CSV.csv(f,rows_for_type_detect=10000)
164.831213 seconds (25.93 M allocations: 940.161 MB, 0.43% gc time)

julia> @time df=readtable(f);
3.640882 seconds (26.84 M allocations: 969.908 MB)

@kafisatz kafisatz closed this as completed May 1, 2016
@quinnj
Copy link
Member

quinnj commented May 1, 2016

hey @kafisatz, a couple of things here:

  • using such a high number for rows_for_typedectect will always make it pretty slow, if you're having trouble getting the right types, it's much faster/easier to use the types argument, something like types=Dict(1=>Float64) in order to specify that the first column should be Float64
  • There's currently a bug where CSV.read is not actually calling in CSV, but just the regular Base.read method, which just reads the file in as an array of bytes; I should probably try to make that an error somehow.
  • I'll add some more documentation for this, but it's pretty painless to convert the result of CSV.csv to a DataFrame, for example:
using DataFrames
using CSV
dt = CSV.csv("myfile.csv")
df = DataFrame(dt)  # converts our Data.Table `dt` to a DataFrame without copying

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants