Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requested CSV Parsing Features [comment here for new requests] #3

Closed
15 of 20 tasks
quinnj opened this issue Jul 14, 2015 · 31 comments
Closed
15 of 20 tasks

Requested CSV Parsing Features [comment here for new requests] #3

quinnj opened this issue Jul 14, 2015 · 31 comments

Comments

@quinnj
Copy link
Member

quinnj commented Jul 14, 2015

This issue is for listing out and refining the desired functionality of working with CSV files. The options so far, more or less implemented:

  • Able to accept/detect compressed files as input; also able to tell CSV what compression to expect
  • Accept URLs as input (this may require some upstream work as Requests.jl sucks hardcore right now for downloading, but maybe we can just use Base.download) [this might be something to revisit, but right now, we support CSV.getfield(io::IOBuffer, ::Type{T}), which would allow for fairly seamless streaming code
  • Ability to specify an arbitrary ASCII delimiter
  • Ability to specify an arbitrary ASCII newline character; not sure what to do about CRLF (\r\n) [we're just going to accept \r, \n, and \r\n and handle those three automatically]
  • Ability to specify a quote character that quotes field values and allows delimiters/newlines in field values
  • Ability to specify an escape character that allows for the quote character inside a quoted field value
  • Ability to provide a custom header of column names
  • Ability to specify a custom line where the column names can be found in the file; the data must start on the line following the column names; not sure what to do about headerless CSV files (if there is such a thing)
  • Ability to specify the types CSV should expect for each column individually or for every column
  • Ability to specify the date format a file
  • Ability to tell CSV to use 'x' number of rows for type inference
  • Ability to specify a thousands separator character
  • Ability to specify a decimal character (e.g. ',' for Europe); not sure how to handle the implementation here
  • Ability to specify a custom NULL value to expect (e.g. "NA", "\N", "NULL", etc.)
  • Ability to skip blank lines
  • Ability to do "SKIP, LIMIT" type SQL functionality when parsing (i.e. read only a chunk of a file)
  • Ability to not parse specified columns (ignore them)
  • Ability to specify a # of lines to skip at the end of a file
  • Right now, we skip leading whitespace for numeric field parsing, but we don't ignore trailing whitespace
  • parse DateTime values

This is of course less feature-rich than pandas or data.table's fread, but I also had an epiphany of sorts the other day with regards to bazillion-feature CSV readers. They have to provide so many features because their languages suck Think about it, pandas needs to provide all these crazy options and parsing function capabilities because otherwise, you'd have to do additional processing in python, which kills the purpose of using a nice C pandas implementation. Same with R to some extent.

For CSV, I want to take the approach that if a certain feature can be done post parsing as efficiently as we'd be able to do it while parsing, then we shouldn't support it. Julia is great and fast, don't be afraid of processing your ugly, misshapen CSV files. We want this implementation to be fast and simple, no need to clutter with extraneous features. Sure we can provide stuff that is convenient for this or that, but I really don't think we need to go overboard.

@johnmyleswhite @davidagold @jiahao @RaviMohan @StefanKarpinski

@ScottPJones
Copy link

First off, make sure you handle RFC 4180 correctly (which means CRLF must be handled, that is part of the standard).

@ScottPJones
Copy link

Able to accept/detect compressed files as input; also able to tell CSV what compression to expect
Accept URLs as input (this may require some upstream work as Requests.jl sucks hardcore right now for downloading, but maybe we can just use Base.download)

Aren't these two things really something that could be done outside of the CSV parser?

Ability to specify an arbitrary ASCII delimiter

tabs are very frequently used, also M$ uses ; when , is used as a decimal separator, AFAIK

Ability to specify an arbitrary ASCII newline character; not sure what to do about CRLF (\r\n)

Is this really necessary? I think all you really need to do is treat: \r, \n, or \r\n all as newlines,
and not complicate things with allowing an arbitrary character.
Have you seen that anything other than those three (old Mac OS, Unix/Linux, DOS/Windows/etc formats)?
The one thing you might want to be able to deal with is an array of strings, but that should be a separate entry point into the parser, IMO.

Ability to specify a quote character that quotes field values and allows delimiters/newlines in field values
Ability to specify an escape character that allows for the quote character inside a quoted field value

Are those necessary? RFC 4180 handles those situations just fine, "" is used to quote a double quote,
and a value enclosed in " is allowed to span multiple lines.

Ability to provide a custom header of column names
Ability to specify a custom line where the column names can be found in the file; the data must start on the line following the column names;

Sounds fine.

not sure what to do about headerless CSV files (if there is such a thing)

Actually, "headerless" CSV files are a very frequent case (more frequent in my experience than
CSV files with headers). RFC 4180 uses the mime type to determine that, if you have a mime type.

Ability to specify the types CSV should expect for each column individually or for every column
Ability to specify the formats of Date/DateTime columns for parsing
Ability to tell CSV to use 'x' number of rows for type inference

All nice features.

Ability to specify a thousands separator character
Ability to specify a decimal character (e.g. ',' for Europe); not sure how to handle the implementation here

Those can frequently be auto-detected, either the numbers will be enclosed in double quotes if they have , in them, or the M$ way (that I haven't actually seen in the wild) of using ; for separators.
If you are trying to infer types from the data, then you have to be careful here, because just because something is within quotes doesn't mean it isn't a numeric field.

Ability to specify a custom NULL value to expect (e.g. "NA", "\N", "NULL", etc.)

May be simply nothing between the commas, or a comma at the end of the line, that's what I've usually seen. Remember, strings might not be within quotes, just like numbers might be in quotes,
so you might have foo,null,bar,"123.456,30"

Ability to skip blank lines
Ability to do "SKIP, LIMIT" type SQL functionality when parsing (i.e. read only a chunk of a file)
Ability to not parse specified columns (ignore them)

Those 3 are nice.

Ability to specify a # of lines to skip at the end of a file

What would that be for? That seems like it would be hard to figure out, you'd have to load the data, and then throw out the last # lines, right?

@StefanKarpinski
Copy link
Member

I don't think that plain strings should be treated as URLs that will be downloaded. I think that only a URL type should work that way. Otherwise it's way too easy for someone to trick you into making arbitrary HTTP / whatever protocol requests.

@StefanKarpinski
Copy link
Member

This also plays well with multiple dispatch – instead of adding URL handling logic to the main body, you can just add methods for the URL type and then call the normal method from there.

@ScottPJones
Copy link

This also plays well with multiple dispatch

Ah, the awesome beauty of Julia! 😀

@jiahao
Copy link

jiahao commented Jul 15, 2015

I've also encountered some CSVs where the header is split across two rows, and this confuses the heck out of all the other CSV readers I've found. It would be good to support multirow headers.

@quinnj
Copy link
Member Author

quinnj commented Jul 15, 2015

Really, multi-row headers, huh? Never heard of that before. can you share an example? It's not immediately obvious how we would handle that in a general case (100 row header?)

@jiahao
Copy link

jiahao commented Jul 15, 2015

Here is a small snippet of a medical data set, called MIMIC II. A lot of the data are specified in plaintext format, like so:

     Time          MCL1       I
(hh:mm:ss.mmm)     (mV)    (mV)
[21:49:14.504]   -0.069  -0.072
[21:49:14.512]   -0.138  -0.040
[21:49:14.520]   -0.207   0.136
...

Not really a CSV per se, but the only reader I've found that can handle these data well is Excel...

@quinnj
Copy link
Member Author

quinnj commented Jul 15, 2015

Ah.....I see. No I've definitely seen these before. The general thing I want to do in these cases is vcat the "header rows" into a single column name, i.e.:

  • Time_(hh:mm:ss.mmm)
  • MCL1_(mV)
  • I_(mV)

@jiahao
Copy link

jiahao commented Jul 15, 2015

Concatenating the header rows would be fine.

@jiahao
Copy link

jiahao commented Jul 15, 2015

It would also be nice to be able to handle fields with \0 characters. For some reason this seems to be rather common in CSV files in the wild. pandas and fread both choke in the presence of \0s.

@quinnj
Copy link
Member Author

quinnj commented Jul 15, 2015

The parser as is in this package handles \0 embedded in strings. Do you think there are cases where they're embedded in other types?

@jiahao
Copy link

jiahao commented Jul 15, 2015

Not that I've encountered. I've only seen \0 in string fields.

@ScottPJones
Copy link

Yes, you really need to handle \0 in quoted strings at the very least (within quoted strings, you really need to be able to handle absolutely any character).
Have you looked at handling common Excel weirdnesses, like ="000123"?

@johnmyleswhite
Copy link

I think you'll ultimately need to support multicharacter end-of-column and end-of-row delimiters. Pandas does this: they often demonstrate parsing the MovieLens data sets, where :: is the end-of-column delimiter.

@ScottPJones
Copy link

@johnmyleswhite That's a strange one! What other oddities have you come across that should be handled?
Thinking of the discussion in #3, couldn't all of that be handled efficiently by generating a reader for that, so that handling multicharacter eoc and eor delimiters would not slow down the 99% case of single character delimiters?

@quinnj quinnj changed the title CSV.File/CSV.read initial supported feature set Requested CSV Parsing Features [comment here for new requests] Oct 23, 2015
@felipenoris
Copy link
Contributor

How about exporting variables to csv files?

@quinnj
Copy link
Member Author

quinnj commented Oct 29, 2015

what do you mean exactly @felipenoris ?

@felipenoris
Copy link
Contributor

@quinnj , I mean save vector or matrix to file. Equivalent to writetable("output.csv", df) on DataFrames.

@AndyGreenwell
Copy link

From the first entry in this thread:

not sure what to do about headerless CSV files (if there is such a thing)

Attempting to load a couple of large CSV files now for which there was no header row included.

With DataFrames.jl, providing header=false just treats all rows as data and provides generic column names of x1, x2, x3, etc.

Any possibility of the same behavior here for "headerless" CSV files.

@quinnj
Copy link
Member Author

quinnj commented Jul 13, 2016

Yep, this is supported. Just provide the header manually and set the datarow, like CSV.read(file; header=["col1","col2","col3"], datarow=1)

@JeffBezanson
Copy link

Feature request: be able to specify a custom parsing function for a certain column. I give you a function, and the type it will return, and you can pass the function a string to parse. For example, the dateformat=fmt option could then be implemented by passing e.g. parsers = (2 => s->Date(s,fmt),), where column 2 has type Date. Could be spelled many ways.

@JeffBezanson
Copy link

Refactoring request: remove the dependency on DataFrames and (less importantly) NullableArrays.

The coupling to DataFrames already seems to be very light, which is great. AFAICT, the core functionality already just inserts values into a vector of column vectors. The only issue is that DataFrames is a pretty large dependency.

I also think missing values should only be handled for columns of type Nullable{T}. If Nullable is requested or inferred as a column type then a NullableArray can be used for that column, I'm fine with that. Ideally you could also stream! into e.g. an Array{Nullable} if you wanted though.

@quinnj
Copy link
Member Author

quinnj commented Aug 12, 2016

Like current master? :)

The newest updates (project DECOUPLE) make CSV.jl basically oblivious to where it's sending data. It provides a Data.getfield{T}(source::CSV.Source, ::Type{T}, row, col) => Nullable{T} method that can be used by any Sink type. The DataFrames code that ingests a CSV.Source is now at https://github.com/JuliaData/DataStreams.jl/blob/a7246389c07df6ee22ebb10c6fd221743bd68b89/src/DataStreams.jl#L260.

The idea is to make it as easy as possible for anyone to write their own "Sink" type that could ingest from a CSV.Source (and still use the convenience functions like CSV.read(source, sink)).

@davidagold
Copy link

davidagold commented Aug 12, 2016

using DataStreams, DataFrames, WeakRefStrings
Isn't this an implicit dependency on DataFrames? Anyway DataStreams requires DataFrames, and CSV requires DataStreams...

EDIT: DataFrames => DataStreams. Too much Data.

@JeffBezanson
Copy link

@quinnj That's great! I'll have to try to write a Sink.

CSV does still do using DataFrames though, so it should probably be added back to REQUIRES.

Anyway this isn't super urgent since you can get the data from a DataFrame without copying.

@quinnj
Copy link
Member Author

quinnj commented Aug 12, 2016

@davidagold is right, DataStreams now has the explicit dependency on DataFrames.

@davidagold
Copy link

@quinnj Would you support moving the DataFrame sink methods to DataFrames?

@quinnj
Copy link
Member Author

quinnj commented Aug 12, 2016

yeah, it's on my todo list, after I clean up SQLite & ODBC

@quinnj
Copy link
Member Author

quinnj commented Aug 16, 2016

BTW @JeffBezanson, the new docs on the Source/Sink interface are here, still trying to figure out why they're not hosting correctly.

@quinnj
Copy link
Member Author

quinnj commented Oct 12, 2016

Closing in favor of specific issues going forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants