na.strings getOption() added so ,, can be read as NA by default in future #2652

mattdowle · 2018-03-02T23:16:29Z

Closes #2106 (again)
Closes #2217
Closes #2214
Closes #2281
Closes #1159

#2524 is related and in discussion as to whether filled character columns should have NA always independently of na.strings. Can address that separately to this PR.

Standardizing fread's default : ,, means NA for all types consistently (in particular in string columns). ,"", means empty string as written by fwrite by default (change made in dev some months back).
See also comment in reopened issue here that this PR reverts the fwrite change in dev for 1-column DTs back to the same consistent default in v1.10.4 as on CRAN.

For all input data (i.e. all types, NA or "", 1 column or >1 column), fread(fwrite(DT)) == DT should be true without needing to change any arguments. This is not true before this PR.

TODO:

~~allow quoted na.strings as per na.strings is too literal when column is quoted on file #2586 and change doc again.~~ Left open issue in this milestone to address separately to this PR. This PR is just about ,, -vs- ,"", default.
display NA in character columns as <NA> just like base R to distinguish from "" and "NA"

codecov-io · 2018-03-02T23:38:18Z

Codecov Report

Merging #2652 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2652      +/-   ##
==========================================
+ Coverage   93.31%   93.31%   +<.01%     
==========================================
  Files          61       61              
  Lines       12191    12196       +5     
==========================================
+ Hits        11376    11381       +5     
  Misses        815      815

Impacted Files	Coverage Δ
R/fread.R	`96.18% <ø> (ø)`	⬆️
R/fwrite.R	`100% <ø> (ø)`	⬆️
R/utils.R	`82.6% <ø> (ø)`	⬆️
src/fread.c	`97.98% <100%> (ø)`	⬆️
R/print.data.table.R	`98.13% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4d8545e...1b035ac. Read the comment docs.

st-pasha

I like how fwrite no longer selects its na attribute based on the number of columns - makes the output more predictable.

However as for fread now parsing ,, as an NA string, I have doubts. On one hand I appreciate how the treatment of ,, now becomes consistent across types (except booleans?), and it is also compatible with fwrite which writes empty strings as always quoted. On the other hand, all other CSV writers do not follow such a convention, and unless the user used option quoting="all", it will output empty string as ,,. Most CSV writers don't even have the notion of NA string.

The end result is that this change might be a breaking change for some users who have regular CSV files. If they have code that reads a file, obtains a character column, and then manipulates that column somehow, then having NAs where they used to have empty strings will likely lead to unexpected results.

I do not think such change should be introduced without weighing all pros and cons, and without the usual deprecation cycle.

mattdowle · 2018-03-03T01:51:12Z

For booleans both ,, and ,NA, are read as NA, regardless of the na.strings= control. Since write.csv writes logical as TRUE | FALSE | NA unlike numeric columns where it writes NA as ,,.

Good points on breakage. Usual option() could be provided to return the old behaviour with notice at the top of NEWS in potential-breaking-changes section. The next version is quite a major change to fread (e.g. logical01too) so if it goes ahead, now seems like as good a time as any, for v1.11.0.

I'll sweep through all fread issues and see if any others are in this area.

More can be done to auto detect files which have used "NA" for NA in character columns. Maybe na.strings= could be __auto__ like skip= is now.

Views welcome from others and we'll keep this PR open a while. So far I've tried to fix the issues linked at the top.

mattdowle · 2018-03-03T03:57:10Z

readr's defaults seem to lean very much towards NA. It doesn't seem to mind about distinguishing. If instead it retained base R's blank string preference then maybe I would have thought again. But it's even more NA-leaning that this PR.

Here's fread's default result under this PR :

> fread(txt)
     A    B
1: 109   MT
2:   7    N
3:  11   NA
4:  41   NB
5:  60   ND
6:   1     
7:   2 <NA>
8:   3   NA
9:   4   NA

and read.csv's default result :

> cat(txt, file=f<-tempfile())
> read.csv(f)
    A    B
1 109   MT
2   7    N
3  11 <NA>
4  41   NB
5  60   ND
6   1     
7   2     
8   3 <NA>
9   4 <NA>

HughParsonage · 2018-03-03T03:59:34Z

There's read_csv(txt, quoted_na = T/F) though.

mattdowle · 2018-03-03T04:35:19Z

It's more about choice of defaults. I'm finding the choice of quoted_na = TRUE by default to be odd, since quoted strings are primarily how most software distinguish the literal from the NA-meaning string. Same area as your #2586 but I see what you mean now on that one -- thanks.

MichaelChirico · 2018-03-07T03:13:47Z

NEWS.md

+    ```
+This option controls how `,,` is read in character columns. It does not affect numeric columns which read `,,` as `NA` regardless. We would like `,,`=>`NA` for consistency with numeric types, and `,"",`=>empty string to be the standard default for `fwrite/fread` character columns so that `fread(fwrite(DT))==DT` without needing any change to any parameters. `fwrite` has never written `NA` as `"NA"`, by default it already writes `,,`. The use of R's `getOption()` allows data.table users to move forward early, or restore old behaviour when the default's default is changed in future.
+
+2. `fread` now reads a column of all 0's and 1's as `logical` rather than `integer`, for convenience to avoid needing to change the type afterwards or use `colClasses`. The old behaviour can be restored with `options(datatable.logical01=FALSE)`. We felt this default change was ok to make because in all operations there should be no difference: R treats `logical` and `integer` the same. If this change does cause a problem, the option is provided to restore old behaviour while you update your code. Similarly, `fwrite` now writes `logical` columns as `0/1` by default, controlled by the same option. `0/1` is smaller and faster than `"TRUE"/"FALSE"`, which can make a significant difference to space and time the more `logical` columns there are. Further, a column of `TRUE/FALSE`s is ok, as well as `True/False`s and `true/false`s, but mixing styles (e.g. `TRUE/false`) is not and will be read as type `character`.


I would be a bit more careful on the wording:

in all operations there should be no difference: R treats logical and integer the same.

But that's not true:

DT = data.table(l_int = c(0, 0, 1, 0), l_log = c(FALSE, FALSE, TRUE, FALSE), i = 1:4) DT[(l_int)] # l_int l_log i # 1: 0 FALSE 1 DT[(l_log)] # l_int l_log i # 1: 1 TRUE 3

I think (?) more accurate is that all arithmetic expecting integer and getting logical will go through as expected (i.e., that sending 0/1 to FALSE/TRUE should be safe, whereas the reverse would cause more issues). Of course any function running an is.integer test will fail (and vice versa for is.logical on integer columns).

I had the same concern. I think as long as we add option for 1.10.6 and change its default from 1.10.8 will be fine.

MichaelChirico · 2018-03-07T03:17:41Z

NEWS.md

@@ -157,6 +167,8 @@ the behaviour of `base:::merge.data.frame()`. Thanks to @sritchie73 for reportin

 35. `CJ()` now fails with proper error message when results would exceed max integer, [#2636](https://github.com/Rdatatable/data.table/issues/2636).

+36. `NA` in character columns now display as `<NA>` just like base R to distinguish from `""` and `"NA"`.


this is nice, no need for quote = TRUE argument by default 👍

MichaelChirico · 2018-03-07T03:22:02Z

LGTM, don't see an option to approve the PR anywhere though 🤔

jangorecki · 2018-03-07T03:48:01Z

If we do 1/0 instead of TRUE/FALSE we could also make #1656, at least as option @mattdowle

jangorecki

consistency of fwrite and fread should be most important, then speed and options to customize.

HughParsonage · 2018-03-07T05:05:30Z

@MichaelChirico For me it was: from this page, select the Files changed tab, click the friendly green button Review, then the Approve radio button.

MichaelChirico · 2018-03-07T06:33:50Z

@HughParsonage thanks, I thought I remembered it on this (Conversation) tab

… comments and reflected at the top of NEWS.

,, now read as NA not empty string

3057372

mattdowle added this to the v1.10.6 milestone Mar 2, 2018

mattdowle requested a review from st-pasha March 2, 2018 23:37

nocov on dev util

d3fc8ed

st-pasha reviewed Mar 3, 2018

View reviewed changes

mattdowle added 2 commits March 2, 2018 16:58

Added test for #2214

5f7b324

Added test for #2217

09cadbe

Updated ?fread to resolve #2586

d58f73c

print NA as <NA>

6fbc9c1

This was referenced Mar 3, 2018

fread/fwrite default options for NA mismatch #2281

Open

fread reads in empty fields as logical NA #1159

Closed

mattdowle added 3 commits March 5, 2018 15:52

na.strings now getOption with no default change yet

88674d2

Merge branch 'master' into na_blank

50c5cc6

New tests need na.strings= as default change is now postponed

bf04b2e

mattdowle changed the title ~~,, now read as NA not empty string~~ na.strings option added so ,, can be read as NA by default in future Mar 6, 2018

mattdowle changed the title ~~na.strings option added so ,, can be read as NA by default in future~~ na.strings getOption() added so ,, can be read as NA by default in future Mar 6, 2018

Breaking changes section added to NEWS. fread(logical01=getOption) too.

30d1c0b

st-pasha approved these changes Mar 6, 2018

View reviewed changes

mattdowle mentioned this pull request Mar 6, 2018

fread(fill=TRUE) fills character fields with empty strings instead of NAs #2524

Open

mattdowle requested review from arunsrinivasan, jangorecki, MichaelChirico and HughParsonage March 6, 2018 18:58

MichaelChirico reviewed Mar 7, 2018

View reviewed changes

jangorecki approved these changes Mar 7, 2018

View reviewed changes

HughParsonage approved these changes Mar 7, 2018

View reviewed changes

MichaelChirico approved these changes Mar 7, 2018

View reviewed changes

mattdowle added 4 commits March 7, 2018 18:34

Merge branch 'master' into na_blank

fd1e4ce

Merge branch 'master' into na_blank

e03095f

Merge branch 'master' into na_blank

94327ea

Merge branch 'master' into na_blank

2070b85

mattdowle mentioned this pull request Mar 16, 2018

Is there any reason for j to evaluate when i returns 0 rows? #2662

Closed

mattdowle added 4 commits March 16, 2018 23:31

Reverted logical01 to FALSE (old default, no change) thanks to review…

90d42f9

… comments and reflected at the top of NEWS.

Coverage

b679215

Link added to getOption 100x speedup submitted to R-core

4814608

New test covers what it was supposed to now.

2d2267c

arunsrinivasan approved these changes Mar 19, 2018

View reviewed changes

NEW item only. Added PR link and embellished wording.

1b035ac

mattdowle merged commit 29e2d46 into master Mar 20, 2018

mattdowle deleted the na_blank branch March 20, 2018 19:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

na.strings getOption() added so ,, can be read as NA by default in future #2652

na.strings getOption() added so ,, can be read as NA by default in future #2652

mattdowle commented Mar 2, 2018 •

edited

Loading

codecov-io commented Mar 2, 2018 •

edited

Loading

st-pasha left a comment •

edited

Loading

mattdowle commented Mar 3, 2018 •

edited

Loading

mattdowle commented Mar 3, 2018 •

edited

Loading

HughParsonage commented Mar 3, 2018

mattdowle commented Mar 3, 2018 •

edited

Loading

MichaelChirico Mar 7, 2018 •

edited

Loading

jangorecki Mar 7, 2018

MichaelChirico Mar 7, 2018

MichaelChirico commented Mar 7, 2018

jangorecki commented Mar 7, 2018 •

edited

Loading

jangorecki left a comment

HughParsonage commented Mar 7, 2018

MichaelChirico commented Mar 7, 2018

		@@ -157,6 +167,8 @@ the behaviour of `base:::merge.data.frame()`. Thanks to @sritchie73 for reportin

		35. `CJ()` now fails with proper error message when results would exceed max integer, [#2636](https://github.com/Rdatatable/data.table/issues/2636).

		36. `NA` in character columns now display as `<NA>` just like base R to distinguish from `""` and `"NA"`.

na.strings getOption() added so ,, can be read as NA by default in future #2652

na.strings getOption() added so ,, can be read as NA by default in future #2652

Conversation

mattdowle commented Mar 2, 2018 • edited Loading

codecov-io commented Mar 2, 2018 • edited Loading

Codecov Report

st-pasha left a comment • edited Loading

Choose a reason for hiding this comment

mattdowle commented Mar 3, 2018 • edited Loading

mattdowle commented Mar 3, 2018 • edited Loading

HughParsonage commented Mar 3, 2018

mattdowle commented Mar 3, 2018 • edited Loading

MichaelChirico Mar 7, 2018 • edited Loading

Choose a reason for hiding this comment

jangorecki Mar 7, 2018

Choose a reason for hiding this comment

MichaelChirico Mar 7, 2018

Choose a reason for hiding this comment

MichaelChirico commented Mar 7, 2018

jangorecki commented Mar 7, 2018 • edited Loading

jangorecki left a comment

Choose a reason for hiding this comment

HughParsonage commented Mar 7, 2018

MichaelChirico commented Mar 7, 2018

mattdowle commented Mar 2, 2018 •

edited

Loading

codecov-io commented Mar 2, 2018 •

edited

Loading

st-pasha left a comment •

edited

Loading

mattdowle commented Mar 3, 2018 •

edited

Loading

mattdowle commented Mar 3, 2018 •

edited

Loading

mattdowle commented Mar 3, 2018 •

edited

Loading

MichaelChirico Mar 7, 2018 •

edited

Loading

jangorecki commented Mar 7, 2018 •

edited

Loading