Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TSV over CSV #340

Closed
mkborregaard opened this issue Oct 24, 2018 · 19 comments
Closed

TSV over CSV #340

mkborregaard opened this issue Oct 24, 2018 · 19 comments

Comments

@mkborregaard
Copy link

mkborregaard commented Oct 24, 2018

In Julia, and in particular this package, comma-separated values files are treated as the "default" file format. This reflects a culture where CSV is treated as default, but that is not international, but only restricted to some countries.

At the heart of it, the comma is the worst possible column separator in countries where the decimal separator is also a comma. Here's a world map of those countries (light green is comma, blue [former British colonies + China] is dot, red and dark green something else):

This makes it either impossible to have data with floating point values (not really an option), or forces everyone to adopt the English dot decimal separator. While good things can be said on just converging on a standard (like we all speak English on Github), it's not as simple as that. For apps like Excel, the way to force English dot separator is to set your computers global locale settings to UK/US, which is first of all very inconvenient, second, most people aren't going to do it. Indeed in e.g. excel, if you open a csv file on a non-English locale and save it back to csv again, Excel will automatically save it using semicolons as column separators! - which is of course Excel's problem, but just illustrates that this is not easily resolved by just arguing that everyone must adopt the English standard.

There's another issue with comma separation, and that is that it makes it impossible to have any natural language (such as notes) in data sets, because natural language often have commas.

Fortunately there exists a standard, which is the standard format in countries with comma decimal separator, that has none of the above problems, is international, and actually makes for much more human-readable text files: the tab-separated value. I'd suggest adopting this standard whereever possible in Julia.

I think many people don't realize just how nice it is for all of us in non-English countries that Julia has adapted UTF-8 as the standard over ASCII - in any other language I've used, strings of natural language (in my case mostly geographic place names) are a constant headache, but not so in Julia. It really underlines that Julia is a language of the present and future in a globalized world. No, seriously.

My most wide-reaching suggestion would be to rename this package TSV instead of CSV. A more basic suggestion is to use tabs instead of commas as the default separator, so that you have to specify the separator explicitly if you want commas. This makes even more sense, as in half the world, csv means "semicolon-separated values", so it is natural to expect the user to choose which one. The very minimal suggestion would be to have CSV automatically use tab-separation in read and write when the file extension is .tsv (and possibly .txt) (CSVFiles does that, and DataFrames.readtable used to do that too).

R solves this issue by having read.table, read.delim, read.delim2, read.csv, read.csv2 etc. But - I don't think that's an example to emulate.

@SimonDanisch
Copy link

A more basic suggestion is to use tabs instead of commas as the default separator

why are we not funding this!? :D Seems like a much better separator!

@nalimilan
Copy link
Member

I agree we should either provide "CSV2" variants or (probably better) detect the separator automatically. FWIW, data.table's fread function does that, so it's not that crazy:

     sep: The separator between columns. Defaults to the character in
          the set ‘[,\t |;:]’ that separates the sample of rows into
          the most number of lines with the same number of fields. Use
          ‘NULL’ or ‘""’ to specify no separator; i.e. each line a
          single character column like ‘base::readLines’ does. 

@Nosferican
Copy link
Contributor

Nosferican commented Nov 22, 2018

Any update on automatically inferring the delim based on file extension for tsv? I have adopted tsv as the standard for the ecosystem I a working on so it would be nice to drop the delim = '\t' for the tutorials. If it is straightforward, I can probably cook the PR.

@davidanthoff
Copy link

CSVFiles.jl automatically uses \t if the file extension is tsv, you could check that out if this is important to you.

@mkborregaard
Copy link
Author

mkborregaard commented Nov 22, 2018

I think it might be meaningful to also infer it automatically as suggested by @nalimilan above, but I think the heuristic is potentially problematic. IIUC this would split

12,345\t23,231
35,231\t43,121

into

12    "345\t23"    231
35    "231\t43"    121

instead of

12.234    23.231
35.231    43.121

@nalimilan
Copy link
Member

The detection rule I proposed would only apply to distinguish CSV files from CSV2 files, i.e. to choose whether the separator is , or ;. TSV files can be detected using the extension.

@mkborregaard
Copy link
Author

mkborregaard commented Nov 23, 2018

OK, makes sense. Would detecting ; then default the decimal separator to ,? (that would make sense imho)

@quinnj
Copy link
Member

quinnj commented Dec 4, 2018

If someone wanted to take a stab at the automatic delimiter detection, the place to "insert" it would be around

kwargs = getkwargs(dateformat, decimal, getbools(truestrings, falsestrings))
, i.e. before we construct the parsinglayers. You have the io argument, and we could change delim=nothing by default. If delim === nothing, then we could check the name of the file (if provided) and try to guess the delimiter from that (".tsv" => '\t', etc.). We could then start parsing and take the approach that fread takes in R: read 5 lines, split them by comma, semicolon, space, etc. and the split that results in the same # of columns for the 5 lines is used as the delimiter. (Note that we'd want to use the readsplitline function already defined in the filedetection.jl file).

All in all, I don't think it'd be too hard to implement, so it would be a good "up for grabs" issue if someone wanted to get more familiar with the codebase a bit. I'm happy to review a PR and help answer any questions.

@Nosferican
Copy link
Contributor

I think it might be best to have the 5-lines rules for .txt files. .cvs and .tsv should probably have the implied delimiters by file extension.

@Nosferican
Copy link
Contributor

@quinnj The tsv auto-detect is ready to be reviewed and merged. As for the the few lines auto-detect for the .txt files, could you give me a complete list of delimiters to check, (e.g., ,, , \t, ;, etc.)? I think are the ones I can think of.

@Nosferican
Copy link
Contributor

With #365 merged, we now have autodetect for tsv, csv, and wsv. txt files default to ,, but that could be modified once the auto-detect is implemented.

@mkborregaard
Copy link
Author

That's great! To my mind this does not close the issue, but it does make it a lot easier for users to work around.

@quinnj
Copy link
Member

quinnj commented Apr 26, 2019

So what exactly is the ask here?

@mkborregaard
Copy link
Author

That tab separated be the default

@Nosferican
Copy link
Contributor

The default now is auto-detect, but tab is the default for .tsv. Best practices would be to save tab-separated values as .tsv rather than .txt or .text.

@mkborregaard
Copy link
Author

Hm that sounds like the issue is actually now closable :-)

@quinnj
Copy link
Member

quinnj commented Apr 26, 2019

I don't think we need to be in the business of appending file extensions when writing. Closing.

@quinnj quinnj closed this as completed Apr 26, 2019
@Nosferican
Copy link
Contributor

My comment is relating to the fact that if people saved their data with tab-delimited and named as such (e.g., filename.tsv) CSV.jl will read it correctly by default now as opposed to before.

@mkborregaard
Copy link
Author

Thanks again guys 💖

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants