Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

na.strings getOption() added so ,, can be read as NA by default in future #2652

Merged
merged 19 commits into from Mar 20, 2018

Conversation

@mattdowle
Copy link
Member

mattdowle commented Mar 2, 2018

Closes #2106 (again)
Closes #2217
Closes #2214
Closes #2281
Closes #1159

#2524 is related and in discussion as to whether filled character columns should have NA always independently of na.strings. Can address that separately to this PR.

Standardizing fread's default : ,, means NA for all types consistently (in particular in string columns). ,"", means empty string as written by fwrite by default (change made in dev some months back).
See also comment in reopened issue here that this PR reverts the fwrite change in dev for 1-column DTs back to the same consistent default in v1.10.4 as on CRAN.

For all input data (i.e. all types, NA or "", 1 column or >1 column), fread(fwrite(DT)) == DT should be true without needing to change any arguments. This is not true before this PR.

TODO:

  • allow quoted na.strings as per #2586 and change doc again. Left open issue in this milestone to address separately to this PR. This PR is just about ,, -vs- ,"", default.
  • display NA in character columns as <NA> just like base R to distinguish from "" and "NA"
@mattdowle mattdowle added this to the v1.10.6 milestone Mar 2, 2018
@mattdowle mattdowle requested a review from st-pasha Mar 2, 2018
@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Mar 2, 2018

Codecov Report

Merging #2652 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2652      +/-   ##
==========================================
+ Coverage   93.31%   93.31%   +<.01%     
==========================================
  Files          61       61              
  Lines       12191    12196       +5     
==========================================
+ Hits        11376    11381       +5     
  Misses        815      815
Impacted Files Coverage Δ
R/fread.R 96.18% <ø> (ø) ⬆️
R/fwrite.R 100% <ø> (ø) ⬆️
R/utils.R 82.6% <ø> (ø) ⬆️
src/fread.c 97.98% <100%> (ø) ⬆️
R/print.data.table.R 98.13% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4d8545e...1b035ac. Read the comment docs.

Copy link
Contributor

st-pasha left a comment

I like how fwrite no longer selects its na attribute based on the number of columns - makes the output more predictable.

However as for fread now parsing ,, as an NA string, I have doubts. On one hand I appreciate how the treatment of ,, now becomes consistent across types (except booleans?), and it is also compatible with fwrite which writes empty strings as always quoted. On the other hand, all other CSV writers do not follow such a convention, and unless the user used option quoting="all", it will output empty string as ,,. Most CSV writers don't even have the notion of NA string.

The end result is that this change might be a breaking change for some users who have regular CSV files. If they have code that reads a file, obtains a character column, and then manipulates that column somehow, then having NAs where they used to have empty strings will likely lead to unexpected results.

I do not think such change should be introduced without weighing all pros and cons, and without the usual deprecation cycle.

mattdowle added 2 commits Mar 3, 2018
@mattdowle

This comment has been minimized.

Copy link
Member Author

mattdowle commented Mar 3, 2018

For booleans both ,, and ,NA, are read as NA, regardless of the na.strings= control. Since write.csv writes logical as TRUE | FALSE | NA unlike numeric columns where it writes NA as ,,.

Good points on breakage. Usual option() could be provided to return the old behaviour with notice at the top of NEWS in potential-breaking-changes section. The next version is quite a major change to fread (e.g. logical01too) so if it goes ahead, now seems like as good a time as any, for v1.11.0.

I'll sweep through all fread issues and see if any others are in this area.

More can be done to auto detect files which have used "NA" for NA in character columns. Maybe na.strings= could be __auto__ like skip= is now.

Views welcome from others and we'll keep this PR open a while. So far I've tried to fix the issues linked at the top.

@mattdowle

This comment has been minimized.

Copy link
Member Author

mattdowle commented Mar 3, 2018

readr's defaults seem to lean very much towards NA. It doesn't seem to mind about distinguishing. If instead it retained base R's blank string preference then maybe I would have thought again. But it's even more NA-leaning that this PR.
screenshot from 2018-03-02 20-16-11 resized

Here's fread's default result under this PR :

> fread(txt)
     A    B
1: 109   MT
2:   7    N
3:  11   NA
4:  41   NB
5:  60   ND
6:   1     
7:   2 <NA>
8:   3   NA
9:   4   NA

and read.csv's default result :

> cat(txt, file=f<-tempfile())
> read.csv(f)
    A    B
1 109   MT
2   7    N
3  11 <NA>
4  41   NB
5  60   ND
6   1     
7   2     
8   3 <NA>
9   4 <NA>
@HughParsonage

This comment has been minimized.

Copy link
Member

HughParsonage commented Mar 3, 2018

There's read_csv(txt, quoted_na = T/F) though.

@mattdowle

This comment has been minimized.

Copy link
Member Author

mattdowle commented Mar 3, 2018

It's more about choice of defaults. I'm finding the choice of quoted_na = TRUE by default to be odd, since quoted strings are primarily how most software distinguish the literal from the NA-meaning string. Same area as your #2586 but I see what you mean now on that one -- thanks.

@mattdowle mattdowle changed the title ,, now read as NA not empty string na.strings option added so ,, can be read as NA by default in future Mar 6, 2018
@mattdowle mattdowle changed the title na.strings option added so ,, can be read as NA by default in future na.strings getOption() added so ,, can be read as NA by default in future Mar 6, 2018
NEWS.md Outdated
```
This option controls how `,,` is read in character columns. It does not affect numeric columns which read `,,` as `NA` regardless. We would like `,,`=>`NA` for consistency with numeric types, and `,"",`=>empty string to be the standard default for `fwrite/fread` character columns so that `fread(fwrite(DT))==DT` without needing any change to any parameters. `fwrite` has never written `NA` as `"NA"`, by default it already writes `,,`. The use of R's `getOption()` allows data.table users to move forward early, or restore old behaviour when the default's default is changed in future.

2. `fread` now reads a column of all 0's and 1's as `logical` rather than `integer`, for convenience to avoid needing to change the type afterwards or use `colClasses`. The old behaviour can be restored with `options(datatable.logical01=FALSE)`. We felt this default change was ok to make because in all operations there should be no difference: R treats `logical` and `integer` the same. If this change does cause a problem, the option is provided to restore old behaviour while you update your code. Similarly, `fwrite` now writes `logical` columns as `0/1` by default, controlled by the same option. `0/1` is smaller and faster than `"TRUE"/"FALSE"`, which can make a significant difference to space and time the more `logical` columns there are. Further, a column of `TRUE/FALSE`s is ok, as well as `True/False`s and `true/false`s, but mixing styles (e.g. `TRUE/false`) is not and will be read as type `character`.

This comment has been minimized.

Copy link
@MichaelChirico

MichaelChirico Mar 7, 2018

Member

I would be a bit more careful on the wording:

in all operations there should be no difference: R treats logical and integer the same.

But that's not true:

DT = data.table(l_int = c(0, 0, 1, 0), l_log = c(FALSE, FALSE, TRUE, FALSE), i = 1:4)
DT[(l_int)]
#    l_int l_log i
# 1:     0 FALSE 1
DT[(l_log)]
#    l_int l_log i
# 1:     1  TRUE 3

I think (?) more accurate is that all arithmetic expecting integer and getting logical will go through as expected (i.e., that sending 0/1 to FALSE/TRUE should be safe, whereas the reverse would cause more issues). Of course any function running an is.integer test will fail (and vice versa for is.logical on integer columns).

This comment has been minimized.

Copy link
@jangorecki

jangorecki Mar 7, 2018

Member

I had the same concern. I think as long as we add option for 1.10.6 and change its default from 1.10.8 will be fine.

@@ -157,6 +167,8 @@ the behaviour of `base:::merge.data.frame()`. Thanks to @sritchie73 for reportin

35. `CJ()` now fails with proper error message when results would exceed max integer, [#2636](https://github.com/Rdatatable/data.table/issues/2636).

36. `NA` in character columns now display as `<NA>` just like base R to distinguish from `""` and `"NA"`.

This comment has been minimized.

Copy link
@MichaelChirico

MichaelChirico Mar 7, 2018

Member

this is nice, no need for quote = TRUE argument by default 👍

@MichaelChirico

This comment has been minimized.

Copy link
Member

MichaelChirico commented Mar 7, 2018

LGTM, don't see an option to approve the PR anywhere though 🤔

@jangorecki

This comment has been minimized.

Copy link
Member

jangorecki commented Mar 7, 2018

If we do 1/0 instead of TRUE/FALSE we could also make #1656, at least as option @mattdowle

Copy link
Member

jangorecki left a comment

consistency of fwrite and fread should be most important, then speed and options to customize.

@HughParsonage

This comment has been minimized.

Copy link
Member

HughParsonage commented Mar 7, 2018

@MichaelChirico For me it was: from this page, select the Files changed tab, click the friendly green button Review, then the Approve radio button.

@MichaelChirico

This comment has been minimized.

Copy link
Member

MichaelChirico commented Mar 7, 2018

@HughParsonage thanks, I thought I remembered it on this (Conversation) tab

mattdowle added 4 commits Mar 8, 2018
@MichaelChirico

This comment has been minimized.

Copy link
Member

MichaelChirico commented on R/onLoad.R in 4814608 Mar 17, 2018

good. 👍 btw any reason not to submit %chin% to base?

This comment has been minimized.

Copy link
Member Author

mattdowle replied Mar 17, 2018

In this getOption case simpler not to use %in% at all. But in general, reason is that %chin% uses the truelength-clobber trick to get its speed. The same trick is used in forder.c and that was accepted in base (a little to my surprise) in a localised way. As the years progress more people will discover that trick (like the Julia folks recently) and people in R-core. If no in-the-wild problems are reported (of R itself: ordering strings) then the trick could start to be used more widely, like by base::%in%. It would be a large patch. Good that %in% is used a lot in base and 10k packages as it would be well tested. %chin% was originally just for internal datatable use but as we got more confident in the trick (after years) the next step was exporting it. Yes next step could be base.

This comment has been minimized.

Copy link
Member

MichaelChirico replied Mar 19, 2018

Can't say I'm aware of the truelength-clobber trick? nothing in ?truelength

@mattdowle mattdowle merged commit 29e2d46 into master Mar 20, 2018
6 checks passed
6 checks passed
codecov/patch 100% of diff hit (target 93.31%)
Details
codecov/project 93.31% (+<.01%) compared to 4d8545e
Details
continuous-integration/appveyor/branch AppVeyor build succeeded
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
@mattdowle mattdowle deleted the na_blank branch Mar 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.