Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

na.strings getOption() added so ,, can be read as NA by default in future #2652

Merged
merged 19 commits into from
Mar 20, 2018

Conversation

mattdowle
Copy link
Member

@mattdowle mattdowle commented Mar 2, 2018

Closes #2106 (again)
Closes #2217
Closes #2214
Closes #2281
Closes #1159

#2524 is related and in discussion as to whether filled character columns should have NA always independently of na.strings. Can address that separately to this PR.

Standardizing fread's default : ,, means NA for all types consistently (in particular in string columns). ,"", means empty string as written by fwrite by default (change made in dev some months back).
See also comment in reopened issue here that this PR reverts the fwrite change in dev for 1-column DTs back to the same consistent default in v1.10.4 as on CRAN.

For all input data (i.e. all types, NA or "", 1 column or >1 column), fread(fwrite(DT)) == DT should be true without needing to change any arguments. This is not true before this PR.

TODO:

  • allow quoted na.strings as per na.strings is too literal when column is quoted on file #2586 and change doc again. Left open issue in this milestone to address separately to this PR. This PR is just about ,, -vs- ,"", default.
  • display NA in character columns as <NA> just like base R to distinguish from "" and "NA"

@mattdowle mattdowle added this to the v1.10.6 milestone Mar 2, 2018
@mattdowle mattdowle requested a review from st-pasha March 2, 2018 23:37
@codecov-io
Copy link

codecov-io commented Mar 2, 2018

Codecov Report

Merging #2652 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2652      +/-   ##
==========================================
+ Coverage   93.31%   93.31%   +<.01%     
==========================================
  Files          61       61              
  Lines       12191    12196       +5     
==========================================
+ Hits        11376    11381       +5     
  Misses        815      815
Impacted Files Coverage Δ
R/fread.R 96.18% <ø> (ø) ⬆️
R/fwrite.R 100% <ø> (ø) ⬆️
R/utils.R 82.6% <ø> (ø) ⬆️
src/fread.c 97.98% <100%> (ø) ⬆️
R/print.data.table.R 98.13% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4d8545e...1b035ac. Read the comment docs.

Copy link
Contributor

@st-pasha st-pasha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how fwrite no longer selects its na attribute based on the number of columns - makes the output more predictable.

However as for fread now parsing ,, as an NA string, I have doubts. On one hand I appreciate how the treatment of ,, now becomes consistent across types (except booleans?), and it is also compatible with fwrite which writes empty strings as always quoted. On the other hand, all other CSV writers do not follow such a convention, and unless the user used option quoting="all", it will output empty string as ,,. Most CSV writers don't even have the notion of NA string.

The end result is that this change might be a breaking change for some users who have regular CSV files. If they have code that reads a file, obtains a character column, and then manipulates that column somehow, then having NAs where they used to have empty strings will likely lead to unexpected results.

I do not think such change should be introduced without weighing all pros and cons, and without the usual deprecation cycle.

@mattdowle
Copy link
Member Author

mattdowle commented Mar 3, 2018

For booleans both ,, and ,NA, are read as NA, regardless of the na.strings= control. Since write.csv writes logical as TRUE | FALSE | NA unlike numeric columns where it writes NA as ,,.

Good points on breakage. Usual option() could be provided to return the old behaviour with notice at the top of NEWS in potential-breaking-changes section. The next version is quite a major change to fread (e.g. logical01too) so if it goes ahead, now seems like as good a time as any, for v1.11.0.

I'll sweep through all fread issues and see if any others are in this area.

More can be done to auto detect files which have used "NA" for NA in character columns. Maybe na.strings= could be __auto__ like skip= is now.

Views welcome from others and we'll keep this PR open a while. So far I've tried to fix the issues linked at the top.

@mattdowle
Copy link
Member Author

mattdowle commented Mar 3, 2018

readr's defaults seem to lean very much towards NA. It doesn't seem to mind about distinguishing. If instead it retained base R's blank string preference then maybe I would have thought again. But it's even more NA-leaning that this PR.
screenshot from 2018-03-02 20-16-11 resized

Here's fread's default result under this PR :

> fread(txt)
     A    B
1: 109   MT
2:   7    N
3:  11   NA
4:  41   NB
5:  60   ND
6:   1     
7:   2 <NA>
8:   3   NA
9:   4   NA

and read.csv's default result :

> cat(txt, file=f<-tempfile())
> read.csv(f)
    A    B
1 109   MT
2   7    N
3  11 <NA>
4  41   NB
5  60   ND
6   1     
7   2     
8   3 <NA>
9   4 <NA>

@HughParsonage
Copy link
Member

There's read_csv(txt, quoted_na = T/F) though.

@mattdowle
Copy link
Member Author

mattdowle commented Mar 3, 2018

It's more about choice of defaults. I'm finding the choice of quoted_na = TRUE by default to be odd, since quoted strings are primarily how most software distinguish the literal from the NA-meaning string. Same area as your #2586 but I see what you mean now on that one -- thanks.

@mattdowle mattdowle changed the title ,, now read as NA not empty string na.strings option added so ,, can be read as NA by default in future Mar 6, 2018
@mattdowle mattdowle changed the title na.strings option added so ,, can be read as NA by default in future na.strings getOption() added so ,, can be read as NA by default in future Mar 6, 2018
NEWS.md Outdated
```
This option controls how `,,` is read in character columns. It does not affect numeric columns which read `,,` as `NA` regardless. We would like `,,`=>`NA` for consistency with numeric types, and `,"",`=>empty string to be the standard default for `fwrite/fread` character columns so that `fread(fwrite(DT))==DT` without needing any change to any parameters. `fwrite` has never written `NA` as `"NA"`, by default it already writes `,,`. The use of R's `getOption()` allows data.table users to move forward early, or restore old behaviour when the default's default is changed in future.

2. `fread` now reads a column of all 0's and 1's as `logical` rather than `integer`, for convenience to avoid needing to change the type afterwards or use `colClasses`. The old behaviour can be restored with `options(datatable.logical01=FALSE)`. We felt this default change was ok to make because in all operations there should be no difference: R treats `logical` and `integer` the same. If this change does cause a problem, the option is provided to restore old behaviour while you update your code. Similarly, `fwrite` now writes `logical` columns as `0/1` by default, controlled by the same option. `0/1` is smaller and faster than `"TRUE"/"FALSE"`, which can make a significant difference to space and time the more `logical` columns there are. Further, a column of `TRUE/FALSE`s is ok, as well as `True/False`s and `true/false`s, but mixing styles (e.g. `TRUE/false`) is not and will be read as type `character`.
Copy link
Member

@MichaelChirico MichaelChirico Mar 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be a bit more careful on the wording:

in all operations there should be no difference: R treats logical and integer the same.

But that's not true:

DT = data.table(l_int = c(0, 0, 1, 0), l_log = c(FALSE, FALSE, TRUE, FALSE), i = 1:4)
DT[(l_int)]
#    l_int l_log i
# 1:     0 FALSE 1
DT[(l_log)]
#    l_int l_log i
# 1:     1  TRUE 3

I think (?) more accurate is that all arithmetic expecting integer and getting logical will go through as expected (i.e., that sending 0/1 to FALSE/TRUE should be safe, whereas the reverse would cause more issues). Of course any function running an is.integer test will fail (and vice versa for is.logical on integer columns).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same concern. I think as long as we add option for 1.10.6 and change its default from 1.10.8 will be fine.

@@ -157,6 +167,8 @@ the behaviour of `base:::merge.data.frame()`. Thanks to @sritchie73 for reportin

35. `CJ()` now fails with proper error message when results would exceed max integer, [#2636](https://github.com/Rdatatable/data.table/issues/2636).

36. `NA` in character columns now display as `<NA>` just like base R to distinguish from `""` and `"NA"`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is nice, no need for quote = TRUE argument by default 👍

@MichaelChirico
Copy link
Member

LGTM, don't see an option to approve the PR anywhere though 🤔

@jangorecki
Copy link
Member

jangorecki commented Mar 7, 2018

If we do 1/0 instead of TRUE/FALSE we could also make #1656, at least as option @mattdowle

Copy link
Member

@jangorecki jangorecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consistency of fwrite and fread should be most important, then speed and options to customize.

@HughParsonage
Copy link
Member

@MichaelChirico For me it was: from this page, select the Files changed tab, click the friendly green button Review, then the Approve radio button.

@MichaelChirico
Copy link
Member

@HughParsonage thanks, I thought I remembered it on this (Conversation) tab

@mattdowle mattdowle merged commit 29e2d46 into master Mar 20, 2018
@mattdowle mattdowle deleted the na_blank branch March 20, 2018 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants