Requested CSV Parsing Features [comment here for new requests] #3

quinnj · 2015-07-14T03:55:55Z

This issue is for listing out and refining the desired functionality of working with CSV files. The options so far, more or less implemented:

This is of course less feature-rich than pandas or data.table's fread, but I also had an epiphany of sorts the other day with regards to bazillion-feature CSV readers. They have to provide so many features because their languages suck Think about it, pandas needs to provide all these crazy options and parsing function capabilities because otherwise, you'd have to do additional processing in python, which kills the purpose of using a nice C pandas implementation. Same with R to some extent.

For CSV, I want to take the approach that if a certain feature can be done post parsing as efficiently as we'd be able to do it while parsing, then we shouldn't support it. Julia is great and fast, don't be afraid of processing your ugly, misshapen CSV files. We want this implementation to be fast and simple, no need to clutter with extraneous features. Sure we can provide stuff that is convenient for this or that, but I really don't think we need to go overboard.

@johnmyleswhite @davidagold @jiahao @RaviMohan @StefanKarpinski

The text was updated successfully, but these errors were encountered:

ScottPJones · 2015-07-14T04:13:07Z

First off, make sure you handle RFC 4180 correctly (which means CRLF must be handled, that is part of the standard).

ScottPJones · 2015-07-14T13:59:49Z

Able to accept/detect compressed files as input; also able to tell CSV what compression to expect
Accept URLs as input (this may require some upstream work as Requests.jl sucks hardcore right now for downloading, but maybe we can just use Base.download)

Aren't these two things really something that could be done outside of the CSV parser?

Ability to specify an arbitrary ASCII delimiter

tabs are very frequently used, also M$ uses ; when , is used as a decimal separator, AFAIK

Ability to specify an arbitrary ASCII newline character; not sure what to do about CRLF (\r\n)

Is this really necessary? I think all you really need to do is treat: \r, \n, or \r\n all as newlines,
and not complicate things with allowing an arbitrary character.
Have you seen that anything other than those three (old Mac OS, Unix/Linux, DOS/Windows/etc formats)?
The one thing you might want to be able to deal with is an array of strings, but that should be a separate entry point into the parser, IMO.

Ability to specify a quote character that quotes field values and allows delimiters/newlines in field values
Ability to specify an escape character that allows for the quote character inside a quoted field value

Are those necessary? RFC 4180 handles those situations just fine, "" is used to quote a double quote,
and a value enclosed in " is allowed to span multiple lines.

Ability to provide a custom header of column names
Ability to specify a custom line where the column names can be found in the file; the data must start on the line following the column names;

Sounds fine.

not sure what to do about headerless CSV files (if there is such a thing)

Actually, "headerless" CSV files are a very frequent case (more frequent in my experience than
CSV files with headers). RFC 4180 uses the mime type to determine that, if you have a mime type.

Ability to specify the types CSV should expect for each column individually or for every column
Ability to specify the formats of Date/DateTime columns for parsing
Ability to tell CSV to use 'x' number of rows for type inference

All nice features.

Ability to specify a thousands separator character
Ability to specify a decimal character (e.g. ',' for Europe); not sure how to handle the implementation here

Those can frequently be auto-detected, either the numbers will be enclosed in double quotes if they have , in them, or the M$ way (that I haven't actually seen in the wild) of using ; for separators.
If you are trying to infer types from the data, then you have to be careful here, because just because something is within quotes doesn't mean it isn't a numeric field.

Ability to specify a custom NULL value to expect (e.g. "NA", "\N", "NULL", etc.)

May be simply nothing between the commas, or a comma at the end of the line, that's what I've usually seen. Remember, strings might not be within quotes, just like numbers might be in quotes,
so you might have foo,null,bar,"123.456,30"

Ability to skip blank lines
Ability to do "SKIP, LIMIT" type SQL functionality when parsing (i.e. read only a chunk of a file)
Ability to not parse specified columns (ignore them)

Those 3 are nice.

Ability to specify a # of lines to skip at the end of a file

What would that be for? That seems like it would be hard to figure out, you'd have to load the data, and then throw out the last # lines, right?

StefanKarpinski · 2015-07-14T16:13:47Z

I don't think that plain strings should be treated as URLs that will be downloaded. I think that only a URL type should work that way. Otherwise it's way too easy for someone to trick you into making arbitrary HTTP / whatever protocol requests.

StefanKarpinski · 2015-07-14T16:14:50Z

This also plays well with multiple dispatch – instead of adding URL handling logic to the main body, you can just add methods for the URL type and then call the normal method from there.

ScottPJones · 2015-07-14T16:17:15Z

This also plays well with multiple dispatch

Ah, the awesome beauty of Julia! 😀

jiahao · 2015-07-15T16:32:19Z

I've also encountered some CSVs where the header is split across two rows, and this confuses the heck out of all the other CSV readers I've found. It would be good to support multirow headers.

quinnj · 2015-07-15T16:34:36Z

Really, multi-row headers, huh? Never heard of that before. can you share an example? It's not immediately obvious how we would handle that in a general case (100 row header?)

jiahao · 2015-07-15T16:39:47Z

Here is a small snippet of a medical data set, called MIMIC II. A lot of the data are specified in plaintext format, like so:

     Time          MCL1       I
(hh:mm:ss.mmm)     (mV)    (mV)
[21:49:14.504]   -0.069  -0.072
[21:49:14.512]   -0.138  -0.040
[21:49:14.520]   -0.207   0.136
...

Not really a CSV per se, but the only reader I've found that can handle these data well is Excel...

quinnj · 2015-07-15T16:41:39Z

Ah.....I see. No I've definitely seen these before. The general thing I want to do in these cases is vcat the "header rows" into a single column name, i.e.:

Time_(hh:mm:ss.mmm)
MCL1_(mV)
I_(mV)

jiahao · 2015-07-15T16:42:13Z

Concatenating the header rows would be fine.

jiahao · 2015-07-15T16:43:17Z

It would also be nice to be able to handle fields with \0 characters. For some reason this seems to be rather common in CSV files in the wild. pandas and fread both choke in the presence of \0s.

quinnj · 2015-07-15T16:44:43Z

The parser as is in this package handles \0 embedded in strings. Do you think there are cases where they're embedded in other types?

jiahao · 2015-07-15T16:45:14Z

Not that I've encountered. I've only seen \0 in string fields.

ScottPJones · 2015-07-15T18:59:27Z

Yes, you really need to handle \0 in quoted strings at the very least (within quoted strings, you really need to be able to handle absolutely any character).
Have you looked at handling common Excel weirdnesses, like ="000123"?

johnmyleswhite · 2015-07-15T22:18:47Z

I think you'll ultimately need to support multicharacter end-of-column and end-of-row delimiters. Pandas does this: they often demonstrate parsing the MovieLens data sets, where :: is the end-of-column delimiter.

ScottPJones · 2015-07-15T22:45:30Z

@johnmyleswhite That's a strange one! What other oddities have you come across that should be handled?
Thinking of the discussion in #3, couldn't all of that be handled efficiently by generating a reader for that, so that handling multicharacter eoc and eor delimiters would not slow down the 99% case of single character delimiters?

felipenoris · 2015-10-29T03:20:04Z

How about exporting variables to csv files?

quinnj · 2015-10-29T03:29:54Z

what do you mean exactly @felipenoris ?

felipenoris · 2015-10-29T03:48:56Z

@quinnj , I mean save vector or matrix to file. Equivalent to writetable("output.csv", df) on DataFrames.

AndyGreenwell · 2016-07-13T19:30:27Z

From the first entry in this thread:

not sure what to do about headerless CSV files (if there is such a thing)

Attempting to load a couple of large CSV files now for which there was no header row included.

With DataFrames.jl, providing header=false just treats all rows as data and provides generic column names of x1, x2, x3, etc.

Any possibility of the same behavior here for "headerless" CSV files.

quinnj · 2016-07-13T19:33:03Z

Yep, this is supported. Just provide the header manually and set the datarow, like CSV.read(file; header=["col1","col2","col3"], datarow=1)

JeffBezanson · 2016-08-08T21:11:37Z

Feature request: be able to specify a custom parsing function for a certain column. I give you a function, and the type it will return, and you can pass the function a string to parse. For example, the dateformat=fmt option could then be implemented by passing e.g. parsers = (2 => s->Date(s,fmt),), where column 2 has type Date. Could be spelled many ways.

JeffBezanson · 2016-08-12T17:37:29Z

Refactoring request: remove the dependency on DataFrames and (less importantly) NullableArrays.

The coupling to DataFrames already seems to be very light, which is great. AFAICT, the core functionality already just inserts values into a vector of column vectors. The only issue is that DataFrames is a pretty large dependency.

I also think missing values should only be handled for columns of type Nullable{T}. If Nullable is requested or inferred as a column type then a NullableArray can be used for that column, I'm fine with that. Ideally you could also stream! into e.g. an Array{Nullable} if you wanted though.

quinnj · 2016-08-12T17:49:56Z

Like current master? :)

The newest updates (project DECOUPLE) make CSV.jl basically oblivious to where it's sending data. It provides a Data.getfield{T}(source::CSV.Source, ::Type{T}, row, col) => Nullable{T} method that can be used by any Sink type. The DataFrames code that ingests a CSV.Source is now at https://github.com/JuliaData/DataStreams.jl/blob/a7246389c07df6ee22ebb10c6fd221743bd68b89/src/DataStreams.jl#L260.

The idea is to make it as easy as possible for anyone to write their own "Sink" type that could ingest from a CSV.Source (and still use the convenience functions like CSV.read(source, sink)).

davidagold · 2016-08-12T18:30:53Z

CSV.jl/src/CSV.jl

Line 4 in cadded8

using DataStreams, DataFrames, WeakRefStrings

Isn't this an implicit dependency on DataFrames? Anyway DataStreams requires DataFrames, and CSV requires DataStreams...

EDIT: DataFrames => DataStreams. Too much Data.

JeffBezanson · 2016-08-12T18:38:08Z

@quinnj That's great! I'll have to try to write a Sink.

CSV does still do using DataFrames though, so it should probably be added back to REQUIRES.

Anyway this isn't super urgent since you can get the data from a DataFrame without copying.

quinnj · 2016-08-12T19:14:04Z

@davidagold is right, DataStreams now has the explicit dependency on DataFrames.

davidagold · 2016-08-12T20:31:54Z

@quinnj Would you support moving the DataFrame sink methods to DataFrames?

quinnj · 2016-08-12T20:33:06Z

yeah, it's on my todo list, after I clean up SQLite & ODBC

quinnj · 2016-08-16T17:43:16Z

BTW @JeffBezanson, the new docs on the Source/Sink interface are here, still trying to figure out why they're not hosting correctly.

quinnj · 2016-10-12T05:52:23Z

Closing in favor of specific issues going forward.

quinnj changed the title ~~CSV.File/CSV.read initial supported feature set~~ Requested CSV Parsing Features [comment here for new requests] Oct 23, 2015

kafisatz mentioned this issue Nov 10, 2015

CSV.read is still incredibly slow on Windows with 0.4.1 #15

Closed

quinnj closed this as completed Oct 12, 2016

gizmaa mentioned this issue Nov 28, 2017

Trailing whitespace in data fields #120

Closed

orhanabar mentioned this issue Oct 21, 2018

using CSV throws an error #337

Closed

thvhauwe mentioned this issue Oct 24, 2018

import CSV / using CSV gives error #342

Closed

kescobo mentioned this issue Oct 1, 2019

Read CSV directly from URL #506

Open

jaakkor2 mentioned this issue May 4, 2020

Ability to specify a thousands separator character #626

Closed

icweaver mentioned this issue May 28, 2021

Feature request: Additional header parsing control #840

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Requested CSV Parsing Features [comment here for new requests] #3

Requested CSV Parsing Features [comment here for new requests] #3

quinnj commented Jul 14, 2015 •

edited

ScottPJones commented Jul 14, 2015

ScottPJones commented Jul 14, 2015

StefanKarpinski commented Jul 14, 2015

StefanKarpinski commented Jul 14, 2015

ScottPJones commented Jul 14, 2015

jiahao commented Jul 15, 2015

quinnj commented Jul 15, 2015

jiahao commented Jul 15, 2015

quinnj commented Jul 15, 2015

jiahao commented Jul 15, 2015

jiahao commented Jul 15, 2015

quinnj commented Jul 15, 2015

jiahao commented Jul 15, 2015

ScottPJones commented Jul 15, 2015

johnmyleswhite commented Jul 15, 2015

ScottPJones commented Jul 15, 2015

felipenoris commented Oct 29, 2015

quinnj commented Oct 29, 2015

felipenoris commented Oct 29, 2015

AndyGreenwell commented Jul 13, 2016

quinnj commented Jul 13, 2016

JeffBezanson commented Aug 8, 2016

JeffBezanson commented Aug 12, 2016

quinnj commented Aug 12, 2016

davidagold commented Aug 12, 2016 •

edited

JeffBezanson commented Aug 12, 2016

quinnj commented Aug 12, 2016

davidagold commented Aug 12, 2016

quinnj commented Aug 12, 2016

quinnj commented Aug 16, 2016

quinnj commented Oct 12, 2016

Requested CSV Parsing Features [comment here for new requests] #3

Requested CSV Parsing Features [comment here for new requests] #3

Comments

quinnj commented Jul 14, 2015 • edited

ScottPJones commented Jul 14, 2015

ScottPJones commented Jul 14, 2015

StefanKarpinski commented Jul 14, 2015

StefanKarpinski commented Jul 14, 2015

ScottPJones commented Jul 14, 2015

jiahao commented Jul 15, 2015

quinnj commented Jul 15, 2015

jiahao commented Jul 15, 2015

quinnj commented Jul 15, 2015

jiahao commented Jul 15, 2015

jiahao commented Jul 15, 2015

quinnj commented Jul 15, 2015

jiahao commented Jul 15, 2015

ScottPJones commented Jul 15, 2015

johnmyleswhite commented Jul 15, 2015

ScottPJones commented Jul 15, 2015

felipenoris commented Oct 29, 2015

quinnj commented Oct 29, 2015

felipenoris commented Oct 29, 2015

AndyGreenwell commented Jul 13, 2016

quinnj commented Jul 13, 2016

JeffBezanson commented Aug 8, 2016

JeffBezanson commented Aug 12, 2016

quinnj commented Aug 12, 2016

davidagold commented Aug 12, 2016 • edited

JeffBezanson commented Aug 12, 2016

quinnj commented Aug 12, 2016

davidagold commented Aug 12, 2016

quinnj commented Aug 12, 2016

quinnj commented Aug 16, 2016

quinnj commented Oct 12, 2016

quinnj commented Jul 14, 2015 •

edited

davidagold commented Aug 12, 2016 •

edited