readcsv performance #3350

ViralBShah · 2013-06-11T04:38:38Z

I am loading a 100MB csv file using readcsv, and it takes 70 seconds. The file reads comma separated values that are a mix of integers and strings. It is 1,600,000 rows and 9 columns. Some rows have a 10th column as well, but that just becomes part of the last column.

The profiler reveals that a majority of the time is spent in split, which is not unexpected, but it would be nice to load such files quickly.

The text was updated successfully, but these errors were encountered:

ViralBShah · 2013-06-11T04:49:46Z

Cc: @tanmaykm

tanmaykm · 2013-06-11T05:14:55Z

I noticed that countlines uses a local 8K buffer to read data into and works off that. That seems to add to its speed. Should that be adopted in readuntil also?

Another approach could also probably be to have a CSV data type that just keeps the data as a blob and cells as offsets into it. Then getindex can return strings (or any other type) on the fly. This should work well for cases where the CSV is readonly or where the CSV needs to be filtered through a user supplied function.

tanmaykm · 2013-06-12T05:47:06Z

related: https://groups.google.com/forum/?fromgroups=#!topic/julia-users/fej63DPHgYo

ViralBShah · 2013-06-12T05:56:32Z

@loladiro Should readcsv be rewritten to use a buffer like countlines? Roughly when do you plan to have buffered I/O implemented?

ViralBShah · 2013-06-12T05:58:20Z

Also, I believe that buffered I/O will only help address about 20% of the performance issue here.

JeffBezanson · 2013-06-12T06:01:01Z

File reading still uses the old I/O system, which is already buffered.

ViralBShah · 2013-06-12T07:21:19Z

Ok, this is what readdlm is doing. It turns out that if instead you do reads in 8k blocks, the way countlines does, the time is much lesser.

julia> @time while !eof(f)
         r = readuntil(f, "\n")            
       end
elapsed time: 24.508526186 seconds

ViralBShah · 2013-06-12T07:24:24Z

This may not have to do with buffered I/O, but just the sheer number of read calls and the number strings created. However, reading a buffer, and then working line by line within that buffer does seem to be much faster.

JeffBezanson · 2013-06-14T02:53:34Z

We need to skip the encoding check that readline does. Somehow we need to specify the encoding of the file.

tanmaykm · 2013-06-14T06:52:02Z

Neither byte_string_classify nor u8_isvalid appear that prominently in sprofile. Will create a readline_ascii and try this out though.

tanmaykm · 2013-06-14T08:30:38Z

A readcsv_ascii method that avoids the call to byte_string_classify in readuntil seems only slightly faster, not enough to make a difference.

julia> @time csv = Base.readcsv("vList1.csv");
elapsed time: 194.58129686 seconds

julia> @time csv = Base.readcsv_ascii("vList1.csv");
elapsed time: 186.778873825 seconds

ViralBShah · 2013-06-14T08:55:10Z

With @tanmaykm 's readcsv implementation, I am seeing 5x performance gains on large files. The trick seems to be avoiding the use of split. I tried read.csv in R as well. Here are some timings for a csv file that is 347M and has 1.8M rows and 12 columns.

                           Time  Memory in top
R (read.csv)               400s   5.5G
Julia (current readcsv)    197s   2.2G
Julia (new readcsv)         37s   2.2G

johnmyleswhite · 2013-06-14T08:58:04Z

Comparing our readcsv with R's is a little problematic: R's function does much more work. Our readcsv is basically broken by design since it will only parse a trivial subset of CSV files that only contain numbers and nothing else.

The DataFrames read_table function is a better comparison. It's unbearably slow on a file of that size.

ViralBShah · 2013-06-14T09:03:36Z

R's readcsv has a ton of features, which I expect we will want too. This was just to check that we are not grossly slow or inefficient in other ways. Currently, all the time in our readcsv seems to be going in memory allocation of substring objects even in the new implementation - but the performance improvement is nice to have.

I did try read_table at first, but that barfed on my file. How do we get these improvements into read_table? Eventually, I do want a dataframe. For now, I am using readcsv and passing the array to DataFrame().

johnmyleswhite · 2013-06-14T09:09:37Z

The performance improvements are huge. Anything @tanmaykm can do to make IO faster will be a big gain in the future.

@JeffBezanson and I were chatting a while back about readcsv in Base: it's tempting to just remove it and have people call readdlm since that's all we're doing. I think the deciding argument for me was this point: if you're trying to read a table from Excel via CSV, you probably want a DataFrame as the result, not an array of strings or anything else that Base could provide.

I will try to get back to read_table next week since it's clearly the weakest part of the DataFrames package. I got sidetracked by the amount of bugs I kept finding in Distributions, which has kept me busy recently. I believe that we need more aggressive buffering in read_table to make it functional. I'll probably need help figuring out how to get read_table to be fast enough.

johnmyleswhite · 2013-06-14T09:10:57Z

Just as a sidenote: the biggest problem with CSV files in the wild is that they're not purely line-delimited: lines of CSV files are allowed to contain unescaped newlines inside them as long as you use quotes.

ViralBShah · 2013-06-14T09:17:27Z

With mixed types, you almost always want a DataFrame. However, there are cases where a csv file is well-formed and is all numbers or strings, in which case - an array is good enough.

I think readcsv is a nice shorthand to have for readdlm. All these enhancements certainly need to go into readdlm.

It would be nice to refactor read_table and readdlm so that they share code to the extent possible. We will soon do a pull request, where the interface to readdlm is worth thinking a bit more about.

ViralBShah · 2013-06-14T09:24:24Z

@johnmyleswhite Even the current readdlm in base does read in numbers and strings, and figures out the types and such. Still not even close to being as feature-ful as R, but it is certainly doing better than just reading CSVs of numbers.

That said, now that we are at it, it is worth making readdlm more useful for "files in the wild".

tanmaykm · 2013-06-14T13:30:34Z

Put the initial code up as a package at https://github.com/tanmaykm/CSV.jl for comments.

It exports a readcsv2 method which works similar to readdlm.

quinnj · 2013-06-14T16:27:49Z

Interesting thread. I've actually been working on related work for Sqlite package over the last few days in writing the readdlmsql function. For those not familiar with the sqldf package in R, its cousin function is read.csv.sql which allows you to specify a delimited file (.csv, .tsv, whatever you want) as well as an SQL statement to return a filtered DataFrame. I'm pleased with the performance so far, in a simple test:

@time Sqlite.readdlmsql(Pkg.dir() * "/Sqlite/test/sales.csv";sql="select * from sales",name="sales")
elapsed time: 13.848334564 seconds
@time readcsv2(Pkg.dir() * "/Sqlite/test/sales.csv")
elapsed time: 10.939370505 seconds

This is using @tanmaykm's new readcsv2 function, without specifying column types (I tried, but I got an error saying ERROR: in fillall: j not defined). readdlmsql can take column types input also, or run a little inference (infer=true), or without specifying, it reads everything as Strings (like readcsv2's default).
The same file in R

 system.time(t <- read.csv("C:/Users/karbarcca/Google Drive/Dropbox/Dropbox/Sears/Teradata/Sears-Julia/sales.csv"))
   user  system elapsed 
   7.75    0.13    8.02

Definitely some work to do here overall to get Julia up to speed. I just pushed the new update to the Sqlite package.

tanmaykm · 2013-06-14T17:07:18Z

readcsv2 does not yet recognize header rows. The file stripped of the first row can be read as:

julia> @time csv = readcsv2("sales_noheader.csv", coltypes=(Int,Int,String,Int,Int,Int,Int,Int,String,Int,Int,Int)) 
elapsed time: 28.139021921 seconds

julia> @time Sqlite.readdlmsql("sales.csv";sql="select * from sales",name="sales")
elapsed time: 21.067049005 seconds

Since sqlite is probably C compiled code, it would perform better, particularly when most of the columns are integers and there is little overhead of receiving them in Julia. It will be interesting to see how much can be gained in Julia by having numbers parsed directly out of byte buffers.

ViralBShah · 2013-06-14T17:49:26Z

I guess that for columns that one knows are ints, either inferred or supplied by the user, it is worth avoiding the overhead of creating substrings. The experiment of extending parseint to work on a region of a bytestring should be easy enough to try out. It is possible that the time spent in parseint shadows the time in the substring creation.

JeffBezanson · 2013-06-14T18:26:15Z

There are two separate file formats: tsv/dlm files, where each row is one line, and CSV, which is more complicated. We could support only the first in base, and have readcsv in DataFrames.

ViralBShah · 2013-06-14T18:33:45Z

I think we are using CSV loosely as a term for the simple files, where each row is one line, for the purposes of discussion here. However, I agree that readcsv with all its bells and whistles probably belongs in DataFrames.

I think that the current readdlm in base is the right feature set, but we should borrow the improvements posted by @tanmaykm.

JeffBezanson · 2013-06-15T23:08:28Z

In the past few days there was a 2x regression in the performance of split, which I just fixed. It's probably still too slow, but worth checking again.

quinnj · 2013-06-15T23:37:07Z

One thing I was going to mention with regards to "csv files in the wild" with quoted values containing \n or the delimiter even. The workaround I currently have for the Sqlite package is a slightly modified search function that gets called within split (search_quoted and split_quoted respectively). The modification just says that while searching, if a \" character is encountered, everything is ignored until the closing \" is found. This is similar to R's read.csv function which has a keyword argument quote = "\"'" where the user can specify the quote character to use.

Would this be useful in base readdlm? It could be as simple as adding another Bool argument to search to say whether to ignore anything inside \" characters. Or adding an entire quote argument that specifies what character to use as the quote character. I think it definitely belongs in DataFrames as @johnmyleswhite mentioned, but also think it could be a useful feature in Base. Though Base's functionality is more focused on numeric data and this arises more in string data, so maybe it's better to just keep it in DataFrames and Sqlite.

ViralBShah · 2013-06-16T04:20:01Z

I think we should push all these real world requirements into DataFrames.

ViralBShah · 2013-06-16T08:39:50Z

With the improved split performance, readcsv is now loads my dataset in 146s instead of 220s. Still a far way from readcsv2, which does so in about 40s.

ViralBShah · 2013-06-16T08:40:46Z

What if split were made to return an Array of SubString instead of String? That is usually what one often wants in any case, and is consistent with the idea of array views.

faster readcsv. fixes #3350

ViralBShah mentioned this issue Jun 14, 2013

DataFrame read_table performance? JuliaData/DataFrames.jl#273

Closed

JeffBezanson closed this as completed in 261e350 Jun 18, 2013

JeffBezanson added a commit that referenced this issue Jun 18, 2013

Merge pull request #3429 from tanmaykm/readcsv

b44da64

faster readcsv. fixes #3350

tanmaykm mentioned this issue Jun 18, 2013

faster readcsv. fixes #3350 #3429

Merged

quinnj mentioned this issue Jun 18, 2013

replace csvread/write with DataFrames reader code #1893

Closed

ViralBShah mentioned this issue Jun 21, 2013

writecsv performance #3483

Closed

alexalemi mentioned this issue Oct 27, 2014

File IO slow #8826

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readcsv performance #3350

readcsv performance #3350

ViralBShah commented Jun 11, 2013

ViralBShah commented Jun 11, 2013

tanmaykm commented Jun 11, 2013

tanmaykm commented Jun 12, 2013

ViralBShah commented Jun 12, 2013

ViralBShah commented Jun 12, 2013

JeffBezanson commented Jun 12, 2013

ViralBShah commented Jun 12, 2013

ViralBShah commented Jun 12, 2013

JeffBezanson commented Jun 14, 2013

tanmaykm commented Jun 14, 2013

tanmaykm commented Jun 14, 2013

ViralBShah commented Jun 14, 2013

johnmyleswhite commented Jun 14, 2013

ViralBShah commented Jun 14, 2013

johnmyleswhite commented Jun 14, 2013

johnmyleswhite commented Jun 14, 2013

ViralBShah commented Jun 14, 2013

ViralBShah commented Jun 14, 2013

tanmaykm commented Jun 14, 2013

quinnj commented Jun 14, 2013

tanmaykm commented Jun 14, 2013

ViralBShah commented Jun 14, 2013

JeffBezanson commented Jun 14, 2013

ViralBShah commented Jun 14, 2013

JeffBezanson commented Jun 15, 2013

quinnj commented Jun 15, 2013

ViralBShah commented Jun 16, 2013

ViralBShah commented Jun 16, 2013

ViralBShah commented Jun 16, 2013

readcsv performance #3350

readcsv performance #3350

Comments

ViralBShah commented Jun 11, 2013

ViralBShah commented Jun 11, 2013

tanmaykm commented Jun 11, 2013

tanmaykm commented Jun 12, 2013

ViralBShah commented Jun 12, 2013

ViralBShah commented Jun 12, 2013

JeffBezanson commented Jun 12, 2013

ViralBShah commented Jun 12, 2013

ViralBShah commented Jun 12, 2013

JeffBezanson commented Jun 14, 2013

tanmaykm commented Jun 14, 2013

tanmaykm commented Jun 14, 2013

ViralBShah commented Jun 14, 2013

johnmyleswhite commented Jun 14, 2013

ViralBShah commented Jun 14, 2013

johnmyleswhite commented Jun 14, 2013

johnmyleswhite commented Jun 14, 2013

ViralBShah commented Jun 14, 2013

ViralBShah commented Jun 14, 2013

tanmaykm commented Jun 14, 2013

quinnj commented Jun 14, 2013

tanmaykm commented Jun 14, 2013

ViralBShah commented Jun 14, 2013

JeffBezanson commented Jun 14, 2013

ViralBShah commented Jun 14, 2013

JeffBezanson commented Jun 15, 2013

quinnj commented Jun 15, 2013

ViralBShah commented Jun 16, 2013

ViralBShah commented Jun 16, 2013

ViralBShah commented Jun 16, 2013