-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TSV over CSV #340
Comments
why are we not funding this!? :D Seems like a much better separator! |
I agree we should either provide "CSV2" variants or (probably better) detect the separator automatically. FWIW, data.table's
|
Any update on automatically inferring the |
CSVFiles.jl automatically uses |
I think it might be meaningful to also infer it automatically as suggested by @nalimilan above, but I think the heuristic is potentially problematic. IIUC this would split
into
instead of
|
The detection rule I proposed would only apply to distinguish CSV files from CSV2 files, i.e. to choose whether the separator is |
OK, makes sense. Would detecting |
If someone wanted to take a stab at the automatic delimiter detection, the place to "insert" it would be around Line 141 in db92782
parsinglayers . You have the io argument, and we could change delim=nothing by default. If delim === nothing , then we could check the name of the file (if provided) and try to guess the delimiter from that (".tsv" => '\t' , etc.). We could then start parsing and take the approach that fread takes in R: read 5 lines, split them by comma, semicolon, space, etc. and the split that results in the same # of columns for the 5 lines is used as the delimiter. (Note that we'd want to use the readsplitline function already defined in the filedetection.jl file).
All in all, I don't think it'd be too hard to implement, so it would be a good "up for grabs" issue if someone wanted to get more familiar with the codebase a bit. I'm happy to review a PR and help answer any questions. |
I think it might be best to have the 5-lines rules for |
@quinnj The tsv auto-detect is ready to be reviewed and merged. As for the the few lines auto-detect for the |
With #365 merged, we now have autodetect for |
That's great! To my mind this does not close the issue, but it does make it a lot easier for users to work around. |
So what exactly is the ask here? |
That tab separated be the default |
The default now is auto-detect, but tab is the default for |
Hm that sounds like the issue is actually now closable :-) |
I don't think we need to be in the business of appending file extensions when writing. Closing. |
My comment is relating to the fact that if people saved their data with tab-delimited and named as such (e.g., |
Thanks again guys 💖 |
In Julia, and in particular this package, comma-separated values files are treated as the "default" file format. This reflects a culture where CSV is treated as default, but that is not international, but only restricted to some countries.
At the heart of it, the comma is the worst possible column separator in countries where the decimal separator is also a comma. Here's a world map of those countries (light green is comma, blue [former British colonies + China] is dot, red and dark green something else):
This makes it either impossible to have data with floating point values (not really an option), or forces everyone to adopt the English dot decimal separator. While good things can be said on just converging on a standard (like we all speak English on Github), it's not as simple as that. For apps like Excel, the way to force English dot separator is to set your computers global locale settings to UK/US, which is first of all very inconvenient, second, most people aren't going to do it. Indeed in e.g. excel, if you open a csv file on a non-English locale and save it back to csv again, Excel will automatically save it using semicolons as column separators! - which is of course Excel's problem, but just illustrates that this is not easily resolved by just arguing that everyone must adopt the English standard.
There's another issue with comma separation, and that is that it makes it impossible to have any natural language (such as notes) in data sets, because natural language often have commas.
Fortunately there exists a standard, which is the standard format in countries with comma decimal separator, that has none of the above problems, is international, and actually makes for much more human-readable text files: the tab-separated value. I'd suggest adopting this standard whereever possible in Julia.
I think many people don't realize just how nice it is for all of us in non-English countries that Julia has adapted UTF-8 as the standard over ASCII - in any other language I've used, strings of natural language (in my case mostly geographic place names) are a constant headache, but not so in Julia. It really underlines that Julia is a language of the present and future in a globalized world. No, seriously.
My most wide-reaching suggestion would be to rename this package TSV instead of CSV. A more basic suggestion is to use tabs instead of commas as the default separator, so that you have to specify the separator explicitly if you want commas. This makes even more sense, as in half the world,
csv
means "semicolon-separated values", so it is natural to expect the user to choose which one. The very minimal suggestion would be to have CSV automatically use tab-separation inread
andwrite
when the file extension is.tsv
(and possibly.txt
) (CSVFiles
does that, andDataFrames.readtable
used to do that too).R solves this issue by having
read.table
,read.delim
,read.delim2
,read.csv
,read.csv2
etc. But - I don't think that's an example to emulate.The text was updated successfully, but these errors were encountered: