-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Requested CSV Parsing Features [comment here for new requests] #3
Comments
First off, make sure you handle RFC 4180 correctly (which means CRLF must be handled, that is part of the standard). |
Aren't these two things really something that could be done outside of the CSV parser?
tabs are very frequently used, also M$ uses
Is this really necessary? I think all you really need to do is treat:
Are those necessary? RFC 4180 handles those situations just fine,
Sounds fine.
Actually, "headerless" CSV files are a very frequent case (more frequent in my experience than
All nice features.
Those can frequently be auto-detected, either the numbers will be enclosed in double quotes if they have
May be simply nothing between the commas, or a comma at the end of the line, that's what I've usually seen. Remember, strings might not be within quotes, just like numbers might be in quotes,
Those 3 are nice.
What would that be for? That seems like it would be hard to figure out, you'd have to load the data, and then throw out the last |
I don't think that plain strings should be treated as URLs that will be downloaded. I think that only a URL type should work that way. Otherwise it's way too easy for someone to trick you into making arbitrary HTTP / whatever protocol requests. |
This also plays well with multiple dispatch – instead of adding URL handling logic to the main body, you can just add methods for the URL type and then call the normal method from there. |
Ah, the awesome beauty of Julia! 😀 |
I've also encountered some CSVs where the header is split across two rows, and this confuses the heck out of all the other CSV readers I've found. It would be good to support multirow headers. |
Really, multi-row headers, huh? Never heard of that before. can you share an example? It's not immediately obvious how we would handle that in a general case (100 row header?) |
Here is a small snippet of a medical data set, called MIMIC II. A lot of the data are specified in plaintext format, like so: Time MCL1 I
(hh:mm:ss.mmm) (mV) (mV)
[21:49:14.504] -0.069 -0.072
[21:49:14.512] -0.138 -0.040
[21:49:14.520] -0.207 0.136
... Not really a CSV per se, but the only reader I've found that can handle these data well is Excel... |
Ah.....I see. No I've definitely seen these before. The general thing I want to do in these cases is vcat the "header rows" into a single column name, i.e.:
|
Concatenating the header rows would be fine. |
It would also be nice to be able to handle fields with |
The parser as is in this package handles |
Not that I've encountered. I've only seen |
Yes, you really need to handle |
I think you'll ultimately need to support multicharacter end-of-column and end-of-row delimiters. Pandas does this: they often demonstrate parsing the MovieLens data sets, where |
@johnmyleswhite That's a strange one! What other oddities have you come across that should be handled? |
How about exporting variables to csv files? |
what do you mean exactly @felipenoris ? |
@quinnj , I mean save vector or matrix to file. Equivalent to |
From the first entry in this thread:
Attempting to load a couple of large CSV files now for which there was no header row included. With DataFrames.jl, providing Any possibility of the same behavior here for "headerless" CSV files. |
Yep, this is supported. Just provide the header manually and set the |
Feature request: be able to specify a custom parsing function for a certain column. I give you a function, and the type it will return, and you can pass the function a string to parse. For example, the |
Refactoring request: remove the dependency on DataFrames and (less importantly) NullableArrays. The coupling to DataFrames already seems to be very light, which is great. AFAICT, the core functionality already just inserts values into a vector of column vectors. The only issue is that DataFrames is a pretty large dependency. I also think missing values should only be handled for columns of type |
Like current master? :) The newest updates (project DECOUPLE) make CSV.jl basically oblivious to where it's sending data. It provides a The idea is to make it as easy as possible for anyone to write their own "Sink" type that could ingest from a CSV.Source (and still use the convenience functions like |
Line 4 in cadded8
EDIT: DataFrames => DataStreams. Too much Data. |
@quinnj That's great! I'll have to try to write a Sink. CSV does still do Anyway this isn't super urgent since you can get the data from a DataFrame without copying. |
@davidagold is right, DataStreams now has the explicit dependency on DataFrames. |
@quinnj Would you support moving the |
yeah, it's on my todo list, after I clean up SQLite & ODBC |
BTW @JeffBezanson, the new docs on the |
Closing in favor of specific issues going forward. |
This issue is for listing out and refining the desired functionality of working with CSV files. The options so far, more or less implemented:
Accept URLs as input (this may require some upstream work as Requests.jl sucks hardcore right now for downloading, but maybe we can just use Base.download)[this might be something to revisit, but right now, we supportCSV.getfield(io::IOBuffer, ::Type{T})
, which would allow for fairly seamless streaming codeAbility to specify an arbitrary ASCII newline character; not sure what to do about CRLF (\r\n)[we're just going to accept\r
,\n
, and\r\n
and handle those three automatically]CSV
should expect for each column individually or for every columnThis is of course less feature-rich than pandas or data.table's fread, but I also had an epiphany of sorts the other day with regards to bazillion-feature CSV readers. They have to provide so many features because their languages suck Think about it, pandas needs to provide all these crazy options and parsing function capabilities because otherwise, you'd have to do additional processing in python, which kills the purpose of using a nice C pandas implementation. Same with R to some extent.
For CSV, I want to take the approach that if a certain feature can be done post parsing as efficiently as we'd be able to do it while parsing, then we shouldn't support it. Julia is great and fast, don't be afraid of processing your ugly, misshapen CSV files. We want this implementation to be fast and simple, no need to clutter with extraneous features. Sure we can provide stuff that is convenient for this or that, but I really don't think we need to go overboard.
@johnmyleswhite @davidagold @jiahao @RaviMohan @StefanKarpinski
The text was updated successfully, but these errors were encountered: