I am loading a 100MB csv file using readcsv, and it takes 70 seconds. The file reads comma separated values that are a mix of integers and strings. It is 1,600,000 rows and 9 columns. Some rows have a 10th column as well, but that just becomes part of the last column.
The profiler reveals that a majority of the time is spent in split, which is not unexpected, but it would be nice to load such files quickly.
I noticed that countlines uses a local 8K buffer to read data into and works off that. That seems to add to its speed. Should that be adopted in readuntil also?
Another approach could also probably be to have a CSV data type that just keeps the data as a blob and cells as offsets into it. Then getindex can return strings (or any other type) on the fly. This should work well for cases where the CSV is readonly or where the CSV needs to be filtered through a user supplied function.
@loladiro Should readcsv be rewritten to use a buffer like countlines? Roughly when do you plan to have buffered I/O implemented?
Also, I believe that buffered I/O will only help address about 20% of the performance issue here.
File reading still uses the old I/O system, which is already buffered.
Ok, this is what readdlm is doing. It turns out that if instead you do reads in 8k blocks, the way countlines does, the time is much lesser.
julia> @time while !eof(f)
r = readuntil(f, "\n")
elapsed time: 24.508526186 seconds
This may not have to do with buffered I/O, but just the sheer number of read calls and the number strings created. However, reading a buffer, and then working line by line within that buffer does seem to be much faster.
We need to skip the encoding check that readline does. Somehow we need to specify the encoding of the file.
Neither byte_string_classify nor u8_isvalid appear that prominently in sprofile. Will create a readline_ascii and try this out though.
A readcsv_ascii method that avoids the call to byte_string_classify in readuntil seems only slightly faster, not enough to make a difference.
julia> @time csv = Base.readcsv("vList1.csv");
elapsed time: 194.58129686 seconds
julia> @time csv = Base.readcsv_ascii("vList1.csv");
elapsed time: 186.778873825 seconds
With @tanmaykm 's readcsv implementation, I am seeing 5x performance gains on large files. The trick seems to be avoiding the use of split. I tried read.csv in R as well. Here are some timings for a csv file that is 347M and has 1.8M rows and 12 columns.
Time Memory in top
R (read.csv) 400s 5.5G
Julia (current readcsv) 197s 2.2G
Julia (new readcsv) 37s 2.2G
Comparing our readcsv with R's is a little problematic: R's function does much more work. Our readcsv is basically broken by design since it will only parse a trivial subset of CSV files that only contain numbers and nothing else.
The DataFrames read_table function is a better comparison. It's unbearably slow on a file of that size.
R's readcsv has a ton of features, which I expect we will want too. This was just to check that we are not grossly slow or inefficient in other ways. Currently, all the time in our readcsv seems to be going in memory allocation of substring objects even in the new implementation - but the performance improvement is nice to have.
I did try read_table at first, but that barfed on my file. How do we get these improvements into read_table? Eventually, I do want a dataframe. For now, I am using readcsv and passing the array to DataFrame().
The performance improvements are huge. Anything @tanmaykm can do to make IO faster will be a big gain in the future.
@JeffBezanson and I were chatting a while back about readcsv in Base: it's tempting to just remove it and have people call readdlm since that's all we're doing. I think the deciding argument for me was this point: if you're trying to read a table from Excel via CSV, you probably want a DataFrame as the result, not an array of strings or anything else that Base could provide.
I will try to get back to read_table next week since it's clearly the weakest part of the DataFrames package. I got sidetracked by the amount of bugs I kept finding in Distributions, which has kept me busy recently. I believe that we need more aggressive buffering in read_table to make it functional. I'll probably need help figuring out how to get read_table to be fast enough.
Just as a sidenote: the biggest problem with CSV files in the wild is that they're not purely line-delimited: lines of CSV files are allowed to contain unescaped newlines inside them as long as you use quotes.
With mixed types, you almost always want a DataFrame. However, there are cases where a csv file is well-formed and is all numbers or strings, in which case - an array is good enough.
I think readcsv is a nice shorthand to have for readdlm. All these enhancements certainly need to go into readdlm.
It would be nice to refactor read_table and readdlm so that they share code to the extent possible. We will soon do a pull request, where the interface to readdlm is worth thinking a bit more about.
@johnmyleswhite Even the current readdlm in base does read in numbers and strings, and figures out the types and such. Still not even close to being as feature-ful as R, but it is certainly doing better than just reading CSVs of numbers.
That said, now that we are at it, it is worth making readdlm more useful for "files in the wild".
Put the initial code up as a package at https://github.com/tanmaykm/CSV.jl for comments.
It exports a readcsv2 method which works similar to readdlm.
Interesting thread. I've actually been working on related work for Sqlite package over the last few days in writing the readdlmsql function. For those not familiar with the sqldf package in R, its cousin function is read.csv.sql which allows you to specify a delimited file (.csv, .tsv, whatever you want) as well as an SQL statement to return a filtered DataFrame. I'm pleased with the performance so far, in a simple test:
@time Sqlite.readdlmsql(Pkg.dir() * "/Sqlite/test/sales.csv";sql="select * from sales",name="sales")
elapsed time: 13.848334564 seconds
@time readcsv2(Pkg.dir() * "/Sqlite/test/sales.csv")
elapsed time: 10.939370505 seconds
This is using @tanmaykm's new readcsv2 function, without specifying column types (I tried, but I got an error saying ERROR: in fillall: j not defined). readdlmsql can take column types input also, or run a little inference (infer=true), or without specifying, it reads everything as Strings (like readcsv2's default).
The same file in R
ERROR: in fillall: j not defined
system.time(t <- read.csv("C:/Users/karbarcca/Google Drive/Dropbox/Dropbox/Sears/Teradata/Sears-Julia/sales.csv"))
user system elapsed
7.75 0.13 8.02
Definitely some work to do here overall to get Julia up to speed. I just pushed the new update to the Sqlite package.
readcsv2 does not yet recognize header rows. The file stripped of the first row can be read as:
julia> @time csv = readcsv2("sales_noheader.csv", coltypes=(Int,Int,String,Int,Int,Int,Int,Int,String,Int,Int,Int))
elapsed time: 28.139021921 seconds
julia> @time Sqlite.readdlmsql("sales.csv";sql="select * from sales",name="sales")
elapsed time: 21.067049005 seconds
Since sqlite is probably C compiled code, it would perform better, particularly when most of the columns are integers and there is little overhead of receiving them in Julia. It will be interesting to see how much can be gained in Julia by having numbers parsed directly out of byte buffers.
I guess that for columns that one knows are ints, either inferred or supplied by the user, it is worth avoiding the overhead of creating substrings. The experiment of extending parseint to work on a region of a bytestring should be easy enough to try out. It is possible that the time spent in parseint shadows the time in the substring creation.
There are two separate file formats: tsv/dlm files, where each row is one line, and CSV, which is more complicated. We could support only the first in base, and have readcsv in DataFrames.
I think we are using CSV loosely as a term for the simple files, where each row is one line, for the purposes of discussion here. However, I agree that readcsv with all its bells and whistles probably belongs in DataFrames.
I think that the current readdlm in base is the right feature set, but we should borrow the improvements posted by @tanmaykm.
In the past few days there was a 2x regression in the performance of split, which I just fixed. It's probably still too slow, but worth checking again.
One thing I was going to mention with regards to "csv files in the wild" with quoted values containing \n or the delimiter even. The workaround I currently have for the Sqlite package is a slightly modified search function that gets called within split (search_quoted and split_quoted respectively). The modification just says that while searching, if a \" character is encountered, everything is ignored until the closing \" is found. This is similar to R's read.csv function which has a keyword argument quote = "\"'" where the user can specify the quote character to use.
quote = "\"'"
Would this be useful in base readdlm? It could be as simple as adding another Bool argument to search to say whether to ignore anything inside \" characters. Or adding an entire quote argument that specifies what character to use as the quote character. I think it definitely belongs in DataFrames as @johnmyleswhite mentioned, but also think it could be a useful feature in Base. Though Base's functionality is more focused on numeric data and this arises more in string data, so maybe it's better to just keep it in DataFrames and Sqlite.
I think we should push all these real world requirements into DataFrames.
With the improved split performance, readcsv is now loads my dataset in 146s instead of 220s. Still a far way from readcsv2, which does so in about 40s.
What if split were made to return an Array of SubString instead of String? That is usually what one often wants in any case, and is consistent with the idea of array views.
faster readcsv. fixes #3350