Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread(fill=TRUE) fills character fields with empty strings instead of NAs #2524

Open
st-pasha opened this issue Dec 14, 2017 · 9 comments
Open

Comments

@st-pasha
Copy link
Contributor

Test case:

> fread("A,B,C\n1,foo,bar\n2", fill=TRUE)
   A   B   C
1: 1 foo bar
2: 2        

Expected output:

> data.table(A=c(1L, 2L), B=c("foo", NA_character_), C=c("bar", NA_character_))
   A   B   C
1: 1 foo bar
2: 2  NA  NA
@mattdowle
Copy link
Member

Have been discussing with Pasha about this. He'd like filled character columns (fill=TRUE) to be filled with NA always, regardless of na.strings. Since there is no doubt they are missing. Or, at least controllable so user could choose that behavior. Whereas I see no difference between ,, in character columns, and a filled character columns and would like whether "" appears in na.strings or not, to affect the treatment of both ,, and filled character columns the same way.

Any views out there?

@MichaelChirico
Copy link
Member

I tend to agree with Pasha, but I don't understand this sentence:

Whereas I see no difference between ,, in character columns, and a filled character columns and would like whether "" appears in na.strings or not, to affect the treatment of both ,, and filled character columns the same way.

Would you mind paraphrasing?

@st-pasha
Copy link
Contributor Author

st-pasha commented Mar 6, 2018

Consider this file:

A,B,C
1,foo,bar
2
3,,baz

If na.strings="", then both Matt and me agree that the file should be understood as

> fread("A,B,C\n1,foo,bar\n2\n3,,baz", fill=TRUE, na.strings="")
   A   B   C
1: 1 foo bar
2: 2  NA  NA
3: 3  NA baz

However when na.strings is anything different, then we have different opinions on the matter:

> # Matt's opinion
> fread("A,B,C\n1,foo,bar\n2\n3,,baz", fill=TRUE, na.strings="?")
   A   B   C
1: 1 foo bar
2: 2        
3: 3     baz

> # my opinion
> fread("A,B,C\n1,foo,bar\n2\n3,,baz", fill=TRUE, na.strings="?")
   A   B   C
1: 1 foo bar
2: 2  NA  NA
3: 3     baz

@MichaelChirico
Copy link
Member

Interesting. I think I agree with Pasha.

To me it seems immediately clear that the second row should be NA, pretty much by definition.

The ambiguity comes from how to treat ,,; in the case when na.strings = '' it's clear that this should also be NA. Given that this is an option, the user, by not using '' in na.strings, is asserting that '' is not missing, and hence must be treated as ''.

Maybe we can add some output to verbose to signal this to the user when it happens, and they can adjust accordingly (e.g., add '' back to na.strings)

@HughParsonage
Copy link
Member

HughParsonage commented Mar 6, 2018

At the risk of adding complexity, but a way to avoid this decision would be to let fill accept values to replace when filling. fill and na.strings do appear to have slightly different purposes.

> fread("A,B,C,D\n1,foo,bar,5\n2\n3,,baz",
        fill=list(logical = NA, integer = 7L, character = ""),
        na.strings="?")
   A   B   C  D
1: 1 foo bar  5
2: 2          7
3: 3     baz  7

On the broader point, how important is it that fread and fwrite be inverses of each other?

@mattdowle
Copy link
Member

mattdowle commented Mar 6, 2018

Great points. I'm not so sure either on the importance of fread/fwrite being inverses of each other, by default. If saving from R and back into R needs to be preserving, wouldn't you use a binary format for that?
When I was doing analysis when importing data I preferred to collapse all things similar to missing, to missing. So I'd change ,[ ]+,, ,"",, ,, all to NA. Then I could use is.na consistently for all types including character, rather than have to remember that "" wasn't NA. In onwards analysis if there was a non-<NA> I could be sure it had at least one character, and that if that was "NA" then it was surely the string literal, such as the stock ticker. If that is why readr has chosen the default na.strings=c("","NA") with data analysis in mind, then I see why.

I like the fill= expansion idea; using the same argument fill= with a list is nice. However, if its default for filling character columns no longer comes from na.strings, that is a backwards incompatible default change that would not be covered by the getOption("datatable.na.string") plan. Can we get away with calling this a bug fix (as this issue is labelled) and just going ahead? Users would have to change their code to pass fill=list(character="") to get back to old behaviour.

In short, as long as by default :

fread("A,B,C\n1,foo,bar\n2\n3,,baz", fill=TRUE)
   A   B   C
1: 1 foo bar
2: 2  NA  NA
3: 3  NA baz

That's the main thing to me. Think everyone agrees on that. That's not the case now and needs PR #2652 to make that happen. Since Pasha has approved that one now, I'll request @MichaelChirico, @HughParsonage, @arunsrinivasan and @jangorecki to add your approvals too please since it's major one at the top of NEWS (but it doesn't actually change any default behaviour yet).

@pstoyanov
Copy link

FWIW, I would support Matt's view -- if the user has supplied explicit na.strings, they should be respected.

Reason: This is a broader issue, not just when the file has "". Some software (e.g. SAS/SPSS) support multiple types of missing values, and the user might be interested only in a subset of them for a concrete analysis (thereby supplying na.strings explicitly). Modifying the definition of 'missing value' behind the scenes would change user intent.

This also has an implication re fread and fwrite being inverses of each other -- this would be possible only within the domain of data structures supported by R. If the file originated in another system (e.g. with support of multiple missing values), that extra/unsupported info would be lost irrevocably when importing into R.

I like MichaelChirico's suggestion for notifying the user that ""s were not converted to NAs -- that would be useful for some users.

@ethanbsmith
Copy link
Contributor

it might be beneficial to explicitly state somewhere that fread and fwrite are not intended to support data-type stable round-tripping persistent storage. these functions are so so fast and convenient, that I fell into the trap of trying to do this without thinking about the suitability of this solution

@jangorecki
Copy link
Member

Unless of course schema is carried together, either as function arguments or CSVY header.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants