Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upna.strings getOption() added so ,, can be read as NA by default in future #2652
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2652 +/- ##
==========================================
+ Coverage 93.31% 93.31% +<.01%
==========================================
Files 61 61
Lines 12191 12196 +5
==========================================
+ Hits 11376 11381 +5
Misses 815 815
Continue to review full report at Codecov.
|
|
I like how However as for The end result is that this change might be a breaking change for some users who have regular CSV files. If they have code that reads a file, obtains a character column, and then manipulates that column somehow, then having NAs where they used to have empty strings will likely lead to unexpected results. I do not think such change should be introduced without weighing all pros and cons, and without the usual deprecation cycle. |
|
For booleans both Good points on breakage. Usual I'll sweep through all fread issues and see if any others are in this area. More can be done to auto detect files which have used Views welcome from others and we'll keep this PR open a while. So far I've tried to fix the issues linked at the top. |
|
There's |
|
It's more about choice of defaults. I'm finding the choice of |
| ``` | ||
| This option controls how `,,` is read in character columns. It does not affect numeric columns which read `,,` as `NA` regardless. We would like `,,`=>`NA` for consistency with numeric types, and `,"",`=>empty string to be the standard default for `fwrite/fread` character columns so that `fread(fwrite(DT))==DT` without needing any change to any parameters. `fwrite` has never written `NA` as `"NA"`, by default it already writes `,,`. The use of R's `getOption()` allows data.table users to move forward early, or restore old behaviour when the default's default is changed in future. | ||
|
|
||
| 2. `fread` now reads a column of all 0's and 1's as `logical` rather than `integer`, for convenience to avoid needing to change the type afterwards or use `colClasses`. The old behaviour can be restored with `options(datatable.logical01=FALSE)`. We felt this default change was ok to make because in all operations there should be no difference: R treats `logical` and `integer` the same. If this change does cause a problem, the option is provided to restore old behaviour while you update your code. Similarly, `fwrite` now writes `logical` columns as `0/1` by default, controlled by the same option. `0/1` is smaller and faster than `"TRUE"/"FALSE"`, which can make a significant difference to space and time the more `logical` columns there are. Further, a column of `TRUE/FALSE`s is ok, as well as `True/False`s and `true/false`s, but mixing styles (e.g. `TRUE/false`) is not and will be read as type `character`. |
MichaelChirico
Mar 7, 2018
•
Member
I would be a bit more careful on the wording:
in all operations there should be no difference: R treats logical and integer the same.
But that's not true:
DT = data.table(l_int = c(0, 0, 1, 0), l_log = c(FALSE, FALSE, TRUE, FALSE), i = 1:4)
DT[(l_int)]
# l_int l_log i
# 1: 0 FALSE 1
DT[(l_log)]
# l_int l_log i
# 1: 1 TRUE 3
I think (?) more accurate is that all arithmetic expecting integer and getting logical will go through as expected (i.e., that sending 0/1 to FALSE/TRUE should be safe, whereas the reverse would cause more issues). Of course any function running an is.integer test will fail (and vice versa for is.logical on integer columns).
I would be a bit more careful on the wording:
in all operations there should be no difference: R treats
logicalandintegerthe same.
But that's not true:
DT = data.table(l_int = c(0, 0, 1, 0), l_log = c(FALSE, FALSE, TRUE, FALSE), i = 1:4)
DT[(l_int)]
# l_int l_log i
# 1: 0 FALSE 1
DT[(l_log)]
# l_int l_log i
# 1: 1 TRUE 3
I think (?) more accurate is that all arithmetic expecting integer and getting logical will go through as expected (i.e., that sending 0/1 to FALSE/TRUE should be safe, whereas the reverse would cause more issues). Of course any function running an is.integer test will fail (and vice versa for is.logical on integer columns).
jangorecki
Mar 7, 2018
Member
I had the same concern. I think as long as we add option for 1.10.6 and change its default from 1.10.8 will be fine.
I had the same concern. I think as long as we add option for 1.10.6 and change its default from 1.10.8 will be fine.
| @@ -157,6 +167,8 @@ the behaviour of `base:::merge.data.frame()`. Thanks to @sritchie73 for reportin | |||
|
|
|||
| 35. `CJ()` now fails with proper error message when results would exceed max integer, [#2636](https://github.com/Rdatatable/data.table/issues/2636). | |||
|
|
|||
| 36. `NA` in character columns now display as `<NA>` just like base R to distinguish from `""` and `"NA"`. | |||
MichaelChirico
Mar 7, 2018
Member
this is nice, no need for quote = TRUE argument by default 👍
this is nice, no need for quote = TRUE argument by default
|
LGTM, don't see an option to approve the PR anywhere though |
|
If we do 1/0 instead of TRUE/FALSE we could also make #1656, at least as option @mattdowle |
|
consistency of fwrite and fread should be most important, then speed and options to customize. |
|
@MichaelChirico For me it was: from this page, select the |
|
@HughParsonage thanks, I thought I remembered it on this (Conversation) tab |
… comments and reflected at the top of NEWS.
This comment has been minimized.
This comment has been minimized.
|
good. |
This comment has been minimized.
This comment has been minimized.
|
In this getOption case simpler not to use %in% at all. But in general, reason is that %chin% uses the truelength-clobber trick to get its speed. The same trick is used in forder.c and that was accepted in base (a little to my surprise) in a localised way. As the years progress more people will discover that trick (like the Julia folks recently) and people in R-core. If no in-the-wild problems are reported (of R itself: ordering strings) then the trick could start to be used more widely, like by base::%in%. It would be a large patch. Good that %in% is used a lot in base and 10k packages as it would be well tested. %chin% was originally just for internal datatable use but as we got more confident in the trick (after years) the next step was exporting it. Yes next step could be base. |
This comment has been minimized.
This comment has been minimized.
|
Can't say I'm aware of the truelength-clobber trick? nothing in |

Closes #2106 (again)
Closes #2217
Closes #2214
Closes #2281
Closes #1159
#2524 is related and in discussion as to whether filled character columns should have NA always independently of
na.strings. Can address that separately to this PR.Standardizing
fread's default :,,meansNAfor all types consistently (in particular in string columns).,"",means empty string as written byfwriteby default (change made in dev some months back).See also comment in reopened issue here that this PR reverts the
fwritechange in dev for 1-column DTs back to the same consistent default in v1.10.4 as on CRAN.For all input data (i.e. all types, NA or "", 1 column or >1 column),
fread(fwrite(DT)) == DTshould be true without needing to change any arguments. This is not true before this PR.TODO:
allow quoted na.strings as per #2586 and change doc again.Left open issue in this milestone to address separately to this PR. This PR is just about,,-vs-,"",default.<NA>just like base R to distinguish from""and"NA"