TSV over CSV #340

mkborregaard · 2018-10-24T06:30:38Z

In Julia, and in particular this package, comma-separated values files are treated as the "default" file format. This reflects a culture where CSV is treated as default, but that is not international, but only restricted to some countries.

At the heart of it, the comma is the worst possible column separator in countries where the decimal separator is also a comma. Here's a world map of those countries (light green is comma, blue [former British colonies + China] is dot, red and dark green something else):

This makes it either impossible to have data with floating point values (not really an option), or forces everyone to adopt the English dot decimal separator. While good things can be said on just converging on a standard (like we all speak English on Github), it's not as simple as that. For apps like Excel, the way to force English dot separator is to set your computers global locale settings to UK/US, which is first of all very inconvenient, second, most people aren't going to do it. Indeed in e.g. excel, if you open a csv file on a non-English locale and save it back to csv again, Excel will automatically save it using semicolons as column separators! - which is of course Excel's problem, but just illustrates that this is not easily resolved by just arguing that everyone must adopt the English standard.

There's another issue with comma separation, and that is that it makes it impossible to have any natural language (such as notes) in data sets, because natural language often have commas.

Fortunately there exists a standard, which is the standard format in countries with comma decimal separator, that has none of the above problems, is international, and actually makes for much more human-readable text files: the tab-separated value. I'd suggest adopting this standard whereever possible in Julia.

I think many people don't realize just how nice it is for all of us in non-English countries that Julia has adapted UTF-8 as the standard over ASCII - in any other language I've used, strings of natural language (in my case mostly geographic place names) are a constant headache, but not so in Julia. It really underlines that Julia is a language of the present and future in a globalized world. No, seriously.

My most wide-reaching suggestion would be to rename this package TSV instead of CSV. A more basic suggestion is to use tabs instead of commas as the default separator, so that you have to specify the separator explicitly if you want commas. This makes even more sense, as in half the world, csv means "semicolon-separated values", so it is natural to expect the user to choose which one. The very minimal suggestion would be to have CSV automatically use tab-separation in read and write when the file extension is .tsv (and possibly .txt) (CSVFiles does that, and DataFrames.readtable used to do that too).

R solves this issue by having read.table, read.delim, read.delim2, read.csv, read.csv2 etc. But - I don't think that's an example to emulate.

The text was updated successfully, but these errors were encountered:

SimonDanisch · 2018-10-24T08:41:34Z

A more basic suggestion is to use tabs instead of commas as the default separator

why are we not funding this!? :D Seems like a much better separator!

nalimilan · 2018-10-26T17:09:39Z

I agree we should either provide "CSV2" variants or (probably better) detect the separator automatically. FWIW, data.table's fread function does that, so it's not that crazy:

     sep: The separator between columns. Defaults to the character in
          the set ‘[,\t |;:]’ that separates the sample of rows into
          the most number of lines with the same number of fields. Use
          ‘NULL’ or ‘""’ to specify no separator; i.e. each line a
          single character column like ‘base::readLines’ does.

Nosferican · 2018-11-22T14:39:50Z

Any update on automatically inferring the delim based on file extension for tsv? I have adopted tsv as the standard for the ecosystem I a working on so it would be nice to drop the delim = '\t' for the tutorials. If it is straightforward, I can probably cook the PR.

davidanthoff · 2018-11-22T16:10:38Z

CSVFiles.jl automatically uses \t if the file extension is tsv, you could check that out if this is important to you.

mkborregaard · 2018-11-22T16:53:12Z

I think it might be meaningful to also infer it automatically as suggested by @nalimilan above, but I think the heuristic is potentially problematic. IIUC this would split

12,345\t23,231
35,231\t43,121

into

12    "345\t23"    231
35    "231\t43"    121

instead of

12.234    23.231
35.231    43.121

nalimilan · 2018-11-23T13:06:55Z

The detection rule I proposed would only apply to distinguish CSV files from CSV2 files, i.e. to choose whether the separator is , or ;. TSV files can be detected using the extension.

mkborregaard · 2018-11-23T13:07:57Z

OK, makes sense. Would detecting ; then default the decimal separator to ,? (that would make sense imho)

quinnj · 2018-12-04T05:27:06Z

If someone wanted to take a stab at the automatic delimiter detection, the place to "insert" it would be around

CSV.jl/src/CSV.jl

Line 141 in db92782

kwargs = getkwargs(dateformat, decimal, getbools(truestrings, falsestrings))

, i.e. before we construct the parsinglayers. You have the io argument, and we could change delim=nothing by default. If delim === nothing, then we could check the name of the file (if provided) and try to guess the delimiter from that (".tsv" => '\t', etc.). We could then start parsing and take the approach that fread takes in R: read 5 lines, split them by comma, semicolon, space, etc. and the split that results in the same # of columns for the 5 lines is used as the delimiter. (Note that we'd want to use the readsplitline function already defined in the filedetection.jl file).

All in all, I don't think it'd be too hard to implement, so it would be a good "up for grabs" issue if someone wanted to get more familiar with the codebase a bit. I'm happy to review a PR and help answer any questions.

Nosferican · 2018-12-15T20:35:27Z

I think it might be best to have the 5-lines rules for .txt files. .cvs and .tsv should probably have the implied delimiters by file extension.

Nosferican · 2018-12-18T18:29:52Z

@quinnj The tsv auto-detect is ready to be reviewed and merged. As for the the few lines auto-detect for the .txt files, could you give me a complete list of delimiters to check, (e.g., ,, , \t, ;, etc.)? I think are the ones I can think of.

Nosferican · 2019-02-06T18:26:28Z

With #365 merged, we now have autodetect for tsv, csv, and wsv. txt files default to ,, but that could be modified once the auto-detect is implemented.

mkborregaard · 2019-02-06T21:20:47Z

That's great! To my mind this does not close the issue, but it does make it a lot easier for users to work around.

quinnj · 2019-04-26T21:30:29Z

So what exactly is the ask here?

mkborregaard · 2019-04-26T22:08:42Z

That tab separated be the default

Nosferican · 2019-04-26T22:10:09Z

The default now is auto-detect, but tab is the default for .tsv. Best practices would be to save tab-separated values as .tsv rather than .txt or .text.

mkborregaard · 2019-04-26T22:11:10Z

Hm that sounds like the issue is actually now closable :-)

quinnj · 2019-04-26T22:14:55Z

I don't think we need to be in the business of appending file extensions when writing. Closing.

Nosferican · 2019-04-26T22:17:31Z

My comment is relating to the fact that if people saved their data with tab-delimited and named as such (e.g., filename.tsv) CSV.jl will read it correctly by default now as opposed to before.

mkborregaard · 2019-04-26T22:24:39Z

Thanks again guys 💖

c42f mentioned this issue Oct 27, 2018

Documentation and examples JuliaData/TypedTables.jl#23

Open

Nosferican mentioned this issue Dec 15, 2018

Auto-detect delim for tsv #365

Merged

quinnj closed this as completed Apr 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TSV over CSV #340

TSV over CSV #340

mkborregaard commented Oct 24, 2018 •

edited

Loading

SimonDanisch commented Oct 24, 2018

nalimilan commented Oct 26, 2018

Nosferican commented Nov 22, 2018 •

edited

Loading

davidanthoff commented Nov 22, 2018

mkborregaard commented Nov 22, 2018 •

edited

Loading

nalimilan commented Nov 23, 2018

mkborregaard commented Nov 23, 2018 •

edited

Loading

quinnj commented Dec 4, 2018

Nosferican commented Dec 15, 2018

Nosferican commented Dec 18, 2018

Nosferican commented Feb 6, 2019

mkborregaard commented Feb 6, 2019

quinnj commented Apr 26, 2019

mkborregaard commented Apr 26, 2019

Nosferican commented Apr 26, 2019

mkborregaard commented Apr 26, 2019

quinnj commented Apr 26, 2019

Nosferican commented Apr 26, 2019

mkborregaard commented Apr 26, 2019

TSV over CSV #340

TSV over CSV #340

Comments

mkborregaard commented Oct 24, 2018 • edited Loading

SimonDanisch commented Oct 24, 2018

nalimilan commented Oct 26, 2018

Nosferican commented Nov 22, 2018 • edited Loading

davidanthoff commented Nov 22, 2018

mkborregaard commented Nov 22, 2018 • edited Loading

nalimilan commented Nov 23, 2018

mkborregaard commented Nov 23, 2018 • edited Loading

quinnj commented Dec 4, 2018

Nosferican commented Dec 15, 2018

Nosferican commented Dec 18, 2018

Nosferican commented Feb 6, 2019

mkborregaard commented Feb 6, 2019

quinnj commented Apr 26, 2019

mkborregaard commented Apr 26, 2019

Nosferican commented Apr 26, 2019

mkborregaard commented Apr 26, 2019

quinnj commented Apr 26, 2019

Nosferican commented Apr 26, 2019

mkborregaard commented Apr 26, 2019

mkborregaard commented Oct 24, 2018 •

edited

Loading

Nosferican commented Nov 22, 2018 •

edited

Loading

mkborregaard commented Nov 22, 2018 •

edited

Loading

mkborregaard commented Nov 23, 2018 •

edited

Loading