Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upVarious colClasses enhancements #2545
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2545 +/- ##
==========================================
+ Coverage 97.01% 97.03% +0.02%
==========================================
Files 66 66
Lines 12484 12559 +75
==========================================
+ Hits 12111 12187 +76
+ Misses 373 372 -1
Continue to review full report at Codecov.
|
|
Hi @HughParsonage. I've invited you to be project member; you'll need to accept the invite before the change will apply. This enables you to create branches in the main repository directly so we can all push to each other's branches. Welcome! |
| test(167, names(print(ggplot(DT,aes(b,f))+geom_point()))[c(1,3)], c("data","plot")) | ||
| DF <- data.frame( a=1:5, b=11:50, d=c("A","B","C","D"), f=1:5, grp=1:5 ) | ||
| res167 <- names(print(ggplot(DF,aes(b,f))+geom_point()))[c(1,3)] | ||
| test(167, names(print(ggplot(DT,aes(b,f))+geom_point()))[c(1,3)], res167) | ||
| # The names() is a stronger test that it has actually plotted, but also because test() sees the invisible result |
mattdowle
Jan 3, 2018
Member
What's the reason for this change please? The test no longer seems to test anything and repeats the test expression.
What's the reason for this change please? The test no longer seems to test anything and repeats the test expression.
HughParsonage
Jan 4, 2018
Author
Member
I understood the test to mean that using data.table as opposed to data.frame does not affect ggplots. Note res167 uses data.frame.
On my computer: with ggplot2 v 2.2.1.9000,
names(print(ggplot(DT,aes(b,f))+geom_point()))[c(1,3)]
# [1] "data" "scales"
not c("data", "plot"). Indeed, none of the names are "plot". Would it be better to use all the names?
I understood the test to mean that using data.table as opposed to data.frame does not affect ggplots. Note res167 uses data.frame.
On my computer: with ggplot2 v 2.2.1.9000,
names(print(ggplot(DT,aes(b,f))+geom_point()))[c(1,3)]
# [1] "data" "scales"
not c("data", "plot"). Indeed, none of the names are "plot". Would it be better to use all the names?
mattdowle
Jan 4, 2018
Member
Oh I see now. I didn't spot the difference DF vs DT in those two lines. ggplot2 changed its names then and if you upgrade to CRAN version ggplot2 v2.2.1 (released a year ago) it should work. Or, is 2.2.1.9000 a more recent development version of ggplot2. If so, has it changed again?
Oh I see now. I didn't spot the difference DF vs DT in those two lines. ggplot2 changed its names then and if you upgrade to CRAN version ggplot2 v2.2.1 (released a year ago) it should work. Or, is 2.2.1.9000 a more recent development version of ggplot2. If so, has it changed again?
mattdowle
Jan 4, 2018
Member
Yep, ggplot2 2.2.1.9000 is latest dev version you have. Good test. Ok they changed again and your solution looks good to me. Maybe just a comment alongside pointing out that DF is being used to find out what the names are.
Yep, ggplot2 2.2.1.9000 is latest dev version you have. Good test. Ok they changed again and your solution looks good to me. Maybe just a comment alongside pointing out that DF is being used to find out what the names are.
HughParsonage
Jan 4, 2018
•
Author
Member
Yes, 2.2.1.9000 is the current dev version. Looking at the code for ggplot2, I can't tell what the expected behaviour ought to be:
ggplot.data.frame returns the list that I see: https://github.com/tidyverse/ggplot2/blob/7b5c185eff0dabef50e838701ab805ae09a1170f/R/plot.r#L92
ggplot_build.ggplot returns the list that the test originally expected. https://github.com/tidyverse/ggplot2/blob/152910a698121ac4fa771d110507e607d45cfc8f/R/plot-build.r#L99
Note that the test appears to proxy the actual appearance of the plot, and the plot does indeed occur on my machine.
Yes, 2.2.1.9000 is the current dev version. Looking at the code for ggplot2, I can't tell what the expected behaviour ought to be:
ggplot.data.frame returns the list that I see: https://github.com/tidyverse/ggplot2/blob/7b5c185eff0dabef50e838701ab805ae09a1170f/R/plot.r#L92
ggplot_build.ggplot returns the list that the test originally expected. https://github.com/tidyverse/ggplot2/blob/152910a698121ac4fa771d110507e607d45cfc8f/R/plot-build.r#L99
Note that the test appears to proxy the actual appearance of the plot, and the plot does indeed occur on my machine.
| test(1445, fread("doublequote_newline.csv")[7:10], data.table(A=c(1L,1L,2L,1L), B=c("a","embedded \"\"field\"\"\nwith some embedded new\nlines as well","not this one","a"))) | ||
| } |
mattdowle
Jan 4, 2018
•
Member
This is passing successfully on AppVeyor Windows tests and on CRAN Windows tests. Something else must be going wrong on your Windows machine, to fix separately? Important to keep the test running on Windows.
This is passing successfully on AppVeyor Windows tests and on CRAN Windows tests. Something else must be going wrong on your Windows machine, to fix separately? Important to keep the test running on Windows.
HughParsonage
Jan 4, 2018
Author
Member
Would you prefer I raise this an issue and revert this change, or relax the test to permit e.g. embedded ""field""\r\nwith some embedded new\r\nlines as well as well as embedded ""field""\nwith some embedded new\nlines as well?
Would you prefer I raise this an issue and revert this change, or relax the test to permit e.g. embedded ""field""\r\nwith some embedded new\r\nlines as well as well as embedded ""field""\nwith some embedded new\nlines as well?
mattdowle
Jan 4, 2018
Member
Yes : new issue please and revert the change. Why does it pass on AppVeyor and CRAN WIndows but it doesn't pass on your Windows?
Yes : new issue please and revert the change. Why does it pass on AppVeyor and CRAN WIndows but it doesn't pass on your Windows?
HughParsonage
Jan 4, 2018
Author
Member
Ahh, I think it's because when I clone the repo, \n is replaced with \r\n by git. My bad.
fread("https://raw.githubusercontent.com/Rdatatable/data.table/master/inst/tests/doublequote_newline.csv")[["B"]][8]
#> [1] "embedded \"\"field\"\"\nwith some embedded new\nlines as well"
fread("inst/tests/doublequote_newline.csv")[["B"]][8]
#> [1] "embedded \"\"field\"\"\r\nwith some embedded new\r\nlines as well"
Ahh, I think it's because when I clone the repo, \n is replaced with \r\n by git. My bad.
fread("https://raw.githubusercontent.com/Rdatatable/data.table/master/inst/tests/doublequote_newline.csv")[["B"]][8]
#> [1] "embedded \"\"field\"\"\nwith some embedded new\nlines as well"
fread("inst/tests/doublequote_newline.csv")[["B"]][8]
#> [1] "embedded \"\"field\"\"\r\nwith some embedded new\r\nlines as well"Test failed locally because the test file changed `\n` to `\r\n` when cloned.
|
(Thinking out loud...) |
|
Do you want me to attempt a simpler version using the same approach (most of the complexity is just to construct the phrasing of the warning message), like if (!all(colClasses %chin% c(NA, "logical", "integer", "numeric", "double", "character", "factor"))) {
warning("colClasses contains unsupported values.")
}(with corresponding tests modified). or do you think listing all the valid classes in |
|
My current thought is that the dispatch to the class method should be implemented rather than this stop-gap. I'm now thinking that would be quicker. I didn't realize that before. |
|
I thought it would be a case of just calling something like
So that's what
It seems you have to use
Maybe |
|
I was thinking something like existsMethod("as.numeric")
#> [1] TRUE
existsMethod("as.Date")
#> [1] FALSE |
|
awesome. minor comments:
|
|
Apologies for the enormous diff, but there were a few tricky paths, as evinced by the number of additional tests
I need to update the |
(Also fix spurious movement of })
|
@HughParsonage Small question: what do the |
|
Closest meaning would be 'partially resolved': if select is used, colClasses need only correspond to the columns in select (with caveat)Basically resolved: there are combinations of Issue #1656 fread/fwrite base data types directly for efficiencyNot fully resolved: base data types can now be implemented within In terms of why the latter isn't resolved, it's a combination of being too difficult for me and out-of-scope: this PR is meant to implement arbitrary |
…from C level to simplify R level. Not yet passing all tests; wip.
| }, | ||
| warning = fun <- function(e) { | ||
| etype = if (inherits(e,"error")) "error" else "warning" | ||
| warning(sprintf("Column '%s' was set by colClasses to be '%s' but fread encountered the following %s:\n\t%s\nso the column has been left as type '%s'", |
MichaelChirico
May 2, 2019
Member
Construct without sprintf (I don't think I have write access to Hugh's branch...)
Construct without sprintf (I don't think I have write access to Hugh's branch...)
mattdowle
May 2, 2019
Member
I don't mind either way, but it's ok with sprintf isn't it?
I don't mind either way, but it's ok with sprintf isn't it?
MichaelChirico
May 2, 2019
Member
I took the cue from WRE Section 1.7:
Try not to split up messages into small pieces. In C error messages use a single format string containing all English words in the messages.
In R error messages do not construct a message with paste (such messages will not be translated) but via multiple arguments to stop or warning, or via gettextf.
I already purged other instances of this construction elsewhere in the code, I can't find the PR right now...
I took the cue from WRE Section 1.7:
Try not to split up messages into small pieces. In C error messages use a single format string containing all English words in the messages.
In R error messages do not construct a message withpaste(such messages will not be translated) but via multiple arguments tostoporwarning, or viagettextf.
I already purged other instances of this construction elsewhere in the code, I can't find the PR right now...
mattdowle
May 2, 2019
•
Member
As it's merged now, follow up PRs are good. I double-checked Hugh is a project member so he can create branches too. Good catch: I'd forgotten about that.
As it's merged now, follow up PRs are good. I double-checked Hugh is a project member so he can create branches too. Good catch: I'd forgotten about that.
MichaelChirico
May 2, 2019
Member
yep, working on it 👍
yep, working on it
| "complex" = as.complex(v), | ||
| "raw" = as_raw(v), # Internal implementation | ||
| "Date" = as.Date(v), | ||
| "POSIXct" = as.POSIXct(v), |
MichaelChirico
May 2, 2019
Member
Is there any way to build customized timezones into the API? Otherwise we should use tz = 'UTC' by default? Since the same string on different machines will parse differently otherwise...
Is there any way to build customized timezones into the API? Otherwise we should use tz = 'UTC' by default? Since the same string on different machines will parse differently otherwise...
mattdowle
May 2, 2019
Member
Good idea: could add tz= to fread I guess. But shouldn't default be whatever as.POSIXct() does by default (which is local time iirc) ?
Good idea: could add tz= to fread I guess. But shouldn't default be whatever as.POSIXct() does by default (which is local time iirc) ?
MichaelChirico
May 2, 2019
Member
I'm basically arguing against R default behavior since I've always found it a bit murky to use inferred timezones --> fread behaves differently on different machines.
Adding tz to fread feels a bit strange, maybe accept tz as a "class" within colClasses a la
colClasses = list(POSIXct = 'time', tz = 'UTC')
Or (requires a bit more parsing within our code but I think feels pretty natural):
colClasses = list(POSIXct[UTC] = 'time')
I'm basically arguing against R default behavior since I've always found it a bit murky to use inferred timezones --> fread behaves differently on different machines.
Adding tz to fread feels a bit strange, maybe accept tz as a "class" within colClasses a la
colClasses = list(POSIXct = 'time', tz = 'UTC')
Or (requires a bit more parsing within our code but I think feels pretty natural):
colClasses = list(POSIXct[UTC] = 'time')
mattdowle
May 2, 2019
•
Member
But eventually colClasses won't be needed to get a POSIXct (it would be built in as native type). Hence tz= being argument to fread. That way, tz= would apply to all datetime columns in the file without needing to specify them in colClasses.
But eventually colClasses won't be needed to get a POSIXct (it would be built in as native type). Hence tz= being argument to fread. That way, tz= would apply to all datetime columns in the file without needing to specify them in colClasses.
MichaelChirico
May 2, 2019
Member
And I guess tz argument could have an API like colClasses to allow flexibly specifying multiple time zones:
tz = list(UTC = c('start', 'end'), 'Australia/South' = c('start_local', 'end_local'))
and
tz = c(start = 'UTC', end = 'UTC', start_local = 'Australia/South', end_local = 'Australia/South')
And I guess tz argument could have an API like colClasses to allow flexibly specifying multiple time zones:
tz = list(UTC = c('start', 'end'), 'Australia/South' = c('start_local', 'end_local'))
and
tz = c(start = 'UTC', end = 'UTC', start_local = 'Australia/South', end_local = 'Australia/South')
jangorecki
May 2, 2019
•
Member
I think it make sense to use tz="UTC" and just document that, is there any follow up issue to not forget about this?
I think it make sense to use tz="UTC" and just document that, is there any follow up issue to not forget about this?
MichaelChirico
May 2, 2019
Member
my only hesitation is I'm not 100% sure we can do this before implementing the tz argument... e.g. some times only exist in certain time zones, so conversion may fail if we assume UTC but could read on user's machine otherwise?
my only hesitation is I'm not 100% sure we can do this before implementing the tz argument... e.g. some times only exist in certain time zones, so conversion may fail if we assume UTC but could read on user's machine otherwise?
jangorecki
May 2, 2019
•
Member
if it may fail then tryCatch, providing more complex structure to colClasses is good idea but should be added with care (ideally another PR), ideally aligning to API we will use in csvy specification. Using tz="UTC" seems to be simplest way to ensure some consistency.
if it may fail then tryCatch, providing more complex structure to colClasses is good idea but should be added with care (ideally another PR), ideally aligning to API we will use in csvy specification. Using tz="UTC" seems to be simplest way to ensure some consistency.
| \itemize{ | ||
| \item{If coercion results in an error or introduces \code{NA}s, the attempt is aborted for that column with warning and the column's type is left unchanged (probably \code{character}).} | ||
| \item{Named list of vectors of column names or numbers are supported where the list names are the class names. The \code{list} form makes it easier to set a batch of columns to be a particular class; see examples. When column numbers are used in the `list` form, they refer to the column number in the file, not the column number after \code{select} or \code{drop} has been applied.} | ||
| \item{Columns are not demoted to a lower type if this would risk loss of information. You have to coerce such columns afterwards yourself, if you really require data loss.} |
MichaelChirico
May 2, 2019
Member
Points 1 & 3 have a lot of overlap, maybe just combine?
Points 1 & 3 have a lot of overlap, maybe just combine?
mattdowle
May 2, 2019
Member
1 is more about classes like POSIXct which take as input the character string.
3 is more about numeric data like 3.14 being specified as integer.
Maybe more examples and/or better wording?
1 is more about classes like POSIXct which take as input the character string.
3 is more about numeric data like 3.14 being specified as integer.
Maybe more examples and/or better wording?
MichaelChirico
May 2, 2019
Member
Sounds good. At worst a tiny bit repetitive but maybe that's a good thing as it's pretty crucial.
Sounds good. At worst a tiny bit repetitive but maybe that's a good thing as it's pretty crucial.
| if (!is.null(names(colClasses))) { # names are column names; convert to list approach | ||
| if (!length(colClasses)) { | ||
| colClasses=NULL; | ||
| } else if (identical(colClasses, "NULL")) { |
MichaelChirico
May 2, 2019
Member
what's the use case for this? It's not mentioned in the manual...
what's the use case for this? It's not mentioned in the manual...
mattdowle
May 2, 2019
Member
Not sure. Just saw it done in Hugh's branch and liked it. Maybe just to catch beginner errors? There is a warning that it is taken to mean NULL.
I can see that it breaks the case-1-not-different rule. So a 1-column file should have the column dropped (and null data.table returned) when colClasses="NULL"? For consistency. On the other hand beginners might well try "NULL".
Not sure. Just saw it done in Hugh's branch and liked it. Maybe just to catch beginner errors? There is a warning that it is taken to mean NULL.
I can see that it breaks the case-1-not-different rule. So a 1-column file should have the column dropped (and null data.table returned) when colClasses="NULL"? For consistency. On the other hand beginners might well try "NULL".
mattdowle
May 2, 2019
•
Member
Maybe user has a loop through types and all columns should be coerced to that type. for (class in c("integer","double","NULL")) fread(...., colClasses=class). But I can't imagine a good reason to do that.
Maybe user has a loop through types and all columns should be coerced to that type. for (class in c("integer","double","NULL")) fread(...., colClasses=class). But I can't imagine a good reason to do that.
MichaelChirico
May 2, 2019
Member
Seems innocuous enough, just wondering if we should document it if there's a specific use case in mind... @HughParsonage any thoughts?
Seems innocuous enough, just wondering if we should document it if there's a specific use case in mind... @HughParsonage any thoughts?
Closes #491 fread: colClasses does not covert to non-builtin types
Closes #2610 colClasses = POSIXct
Closes #1634 fread doesn't check colClasses to be valid type
Closes #2025 allow
stringsAsFactorsparameter to be a fraction between 0 and 1(no issue number) When
colClassesincludes multiple list elements namedfactor(e.g.list(factor = 1, factor = 2)), all elements are respected, not just the first.#1445 has been fixed when
selectis non-monotonic.Not tackled here
#1656 fread/fwrite *base data types* directly for efficiency
#1426 if
selectis used,colClassesneed only correspond to the columns inselect; done in #3547