Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread's na.strings argument should handle values like "-999" #1314

Closed
arunsrinivasan opened this issue Sep 7, 2015 · 3 comments
Closed

fread's na.strings argument should handle values like "-999" #1314

arunsrinivasan opened this issue Sep 7, 2015 · 3 comments
Assignees
Milestone

Comments

@arunsrinivasan
Copy link
Member

With @dselivanov's excellent PR, providing na.strings value doesn't result in columns being coerced to character anymore.

There's still one point left to address though, as mentioned under that PR -- cases like these:

require(data.table)
DT = data.table(a=9:10, b=9:10 + 0.1, c=as.logical(0:1))
text = do.call("paste", c(DT, collapse="\n", sep=","))
ans1 = fread(text, na.strings=c("9", "9.1", "FALSE"))
#    V1   V2    V3
#1:  9  9.1 FALSE
#2: 10 10.1  TRUE

sapply(ans1, class)
#         V1        V2        V3 
#  "integer" "numeric" "logical" 

# whereas read.table() gives
ans2 = read.table(text=text, na.strings=c("9", "9.1", "FALSE"), sep=",", header=FALSE)
#   V1   V2   V3
#1 NA   NA   NA
#2 10 10.1 TRUE

sapply(ans2, class)
#         V1        V2        V3 
#  "integer" "numeric" "logical" 

read.table() handles them correctly.

@arunsrinivasan
Copy link
Member Author

Benchmarks on 0.1, 0.01, 0.3 and na_rich scenarios after raising the if-statement.

#    File thisFR stable (6f99651)
#     0.1 51.421 48.068
#    0.01 51.776 54.264
#   0.001 52.255 50.820
#     0.3 48.150 47.930
# na_rich 17.570 17.354

@dselivanov
Copy link

@arunsrinivasan, nice catch! Surprised by this timings!

@arunsrinivasan
Copy link
Member Author

@dselivanov thanks. I wasn't clear with what the timings were. It's without na.strings=., to test default scenario. I'll add benchmarks with the arg later. I'd guess there'd be some hit. But not sure yet how to avoid that with the functionality

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants