Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various colClasses enhancements #2545

Merged
merged 51 commits into from May 2, 2019
Merged

Conversation

@HughParsonage
Copy link
Member

@HughParsonage HughParsonage commented Jan 3, 2018

Closes #491 fread: colClasses does not covert to non-builtin types
Closes #2610 colClasses = POSIXct
Closes #1634 fread doesn't check colClasses to be valid type
Closes #2025 allow stringsAsFactors parameter to be a fraction between 0 and 1
(no issue number) When colClasses includes multiple list elements named factor (e.g. list(factor = 1, factor = 2)), all elements are respected, not just the first.
#1445 has been fixed when select is non-monotonic.

Not tackled here
#1656 fread/fwrite *base data types* directly for efficiency
#1426 if select is used, colClasses need only correspond to the columns in select; done in #3547

@codecov-io
Copy link

@codecov-io codecov-io commented Jan 3, 2018

Codecov Report

Merging #2545 into master will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2545      +/-   ##
==========================================
+ Coverage   97.01%   97.03%   +0.02%     
==========================================
  Files          66       66              
  Lines       12484    12559      +75     
==========================================
+ Hits        12111    12187      +76     
+ Misses        373      372       -1
Impacted Files Coverage Δ
src/freadR.c 96.54% <100%> (+0.25%) ⬆️
R/fread.R 99.4% <100%> (+0.84%) ⬆️
src/init.c 100% <100%> (ø) ⬆️
R/data.table.R 97.65% <100%> (ø) ⬆️
src/fread.c 98.5% <100%> (ø) ⬆️
src/rbindlist.c 100% <0%> (ø) ⬆️
R/frank.R 100% <0%> (ø) ⬆️
R/setops.R 98.96% <0%> (ø) ⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 78acb70...8ef1402. Read the comment docs.

@mattdowle
Copy link
Member

@mattdowle mattdowle commented Jan 3, 2018

Hi @HughParsonage. I've invited you to be project member; you'll need to accept the invite before the change will apply. This enables you to create branches in the main repository directly so we can all push to each other's branches. Welcome!

test(167, names(print(ggplot(DT,aes(b,f))+geom_point()))[c(1,3)], c("data","plot"))
DF <- data.frame( a=1:5, b=11:50, d=c("A","B","C","D"), f=1:5, grp=1:5 )
res167 <- names(print(ggplot(DF,aes(b,f))+geom_point()))[c(1,3)]
test(167, names(print(ggplot(DT,aes(b,f))+geom_point()))[c(1,3)], res167)
# The names() is a stronger test that it has actually plotted, but also because test() sees the invisible result
Copy link
Member

@mattdowle mattdowle Jan 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for this change please? The test no longer seems to test anything and repeats the test expression.

Copy link
Member

@mattdowle mattdowle Jan 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible you hit this unrelated issue with the same test 167 : #2546

Copy link
Member Author

@HughParsonage HughParsonage Jan 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understood the test to mean that using data.table as opposed to data.frame does not affect ggplots. Note res167 uses data.frame.

On my computer: with ggplot2 v 2.2.1.9000,

names(print(ggplot(DT,aes(b,f))+geom_point()))[c(1,3)]
# [1] "data" "scales"

not c("data", "plot"). Indeed, none of the names are "plot". Would it be better to use all the names?

Copy link
Member

@mattdowle mattdowle Jan 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see now. I didn't spot the difference DF vs DT in those two lines. ggplot2 changed its names then and if you upgrade to CRAN version ggplot2 v2.2.1 (released a year ago) it should work. Or, is 2.2.1.9000 a more recent development version of ggplot2. If so, has it changed again?

Copy link
Member

@mattdowle mattdowle Jan 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, ggplot2 2.2.1.9000 is latest dev version you have. Good test. Ok they changed again and your solution looks good to me. Maybe just a comment alongside pointing out that DF is being used to find out what the names are.

Copy link
Member Author

@HughParsonage HughParsonage Jan 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, 2.2.1.9000 is the current dev version. Looking at the code for ggplot2, I can't tell what the expected behaviour ought to be:

ggplot.data.frame returns the list that I see: https://github.com/tidyverse/ggplot2/blob/7b5c185eff0dabef50e838701ab805ae09a1170f/R/plot.r#L92

ggplot_build.ggplot returns the list that the test originally expected. https://github.com/tidyverse/ggplot2/blob/152910a698121ac4fa771d110507e607d45cfc8f/R/plot-build.r#L99

Note that the test appears to proxy the actual appearance of the plot, and the plot does indeed occur on my machine.

test(1445, fread("doublequote_newline.csv")[7:10], data.table(A=c(1L,1L,2L,1L), B=c("a","embedded \"\"field\"\"\nwith some embedded new\nlines as well","not this one","a")))
}
Copy link
Member

@mattdowle mattdowle Jan 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is passing successfully on AppVeyor Windows tests and on CRAN Windows tests. Something else must be going wrong on your Windows machine, to fix separately? Important to keep the test running on Windows.

Copy link
Member Author

@HughParsonage HughParsonage Jan 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you prefer I raise this an issue and revert this change, or relax the test to permit e.g. embedded ""field""\r\nwith some embedded new\r\nlines as well as well as embedded ""field""\nwith some embedded new\nlines as well?

Copy link
Member

@mattdowle mattdowle Jan 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes : new issue please and revert the change. Why does it pass on AppVeyor and CRAN WIndows but it doesn't pass on your Windows?

Copy link
Member Author

@HughParsonage HughParsonage Jan 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I think it's because when I clone the repo, \n is replaced with \r\n by git. My bad.

fread("https://raw.githubusercontent.com/Rdatatable/data.table/master/inst/tests/doublequote_newline.csv")[["B"]][8]
#> [1] "embedded \"\"field\"\"\nwith some embedded new\nlines as well"

fread("inst/tests/doublequote_newline.csv")[["B"]][8]
#> [1] "embedded \"\"field\"\"\r\nwith some embedded new\r\nlines as well"

Copy link
Member

@mattdowle mattdowle left a comment

(Thinking out loud...)
The end result is great but going too far I feel. check_colClasses_validity() has quite a lot of lines to maintain relative to what it does. When I filed the issue, I had in mind just a quick 2 or 3 extra lines in fread! In freadR.c on line 32 are the R names that are matched to colClasses supplied by user. When there's no match, the idea was at some point to read data as character and then call the class method afterwards at R level, like read.csv does. If there is no class method (e.g. no as.foo exists at R level, not even defined by user), that's when it should fail, for that reason (standard R error that as.foo() does not exist). There's 'CLASS' at C level in freadR.c with that in mind but I don't recall how far we went implementing that. Listing the valid and invalid types explicitly in check_colClasses_validity isn't in the spirit of S3 dispatch. Maybe all we need is the class method dispatched to call, say, as.Date in the meantime until we implement directly at C level. And when we do implement directly at C level, there will be no change for the user other than experiencing a speedup.

@HughParsonage
Copy link
Member Author

@HughParsonage HughParsonage commented Jan 4, 2018

Do you want me to attempt a simpler version using the same approach (most of the complexity is just to construct the phrasing of the warning message), like

    if (!all(colClasses %chin% c(NA, "logical", "integer", "numeric", "double", "character", "factor"))) {
      warning("colClasses contains unsupported values.")
    }

(with corresponding tests modified).

or do you think listing all the valid classes in fread.R is simply the wrong approach at this time? Obviously such an approach would still not be in the spirit of S3 dispatch, but this PR is merely intended as a stop-gap for the issue while colClasses = "Date" etc are not yet implemented.

@mattdowle
Copy link
Member

@mattdowle mattdowle commented Jan 4, 2018

My current thought is that the dispatch to the class method should be implemented rather than this stop-gap. I'm now thinking that would be quicker. I didn't realize that before.
To confirm the new thought: integer64 is a valid type for colClasses= currently and should work. But we must be missing a test for using integer64 in colClasses= because integer64 isn't defined in check_colClasses_validity().

@mattdowle
Copy link
Member

@mattdowle mattdowle commented Jan 4, 2018

I thought it would be a case of just calling something like DT[, (col):=as(col, colClasses[i])] for each column near the end of fread() at R level.
This was the R message I had in mind :

> as("2018-01-03", "foo")
Error in as("2018-01-03", "foo") : 
  no method or default for coercing “character” to “foo”

So that's what fread user would see if they used "foo" in colClasses=.
But then Date fails too which I didn't expect :

> as("2018-01-03", "Date")
Error in as("2018-01-03", "Date") : 
  no method or default for coercing “character” to “Date”

It seems you have to use as.Date like this :

> as.Date("2018-01-03")
[1] "2018-01-03"

Maybe as() is S4 not S3? Otherwise I don't see why it doesn't dispatch to the .default() method. I looked at ?as which didn't help on first glance. We could look at read.csv to see how it does it. I seem to remember it coerces at C level.

@HughParsonage
Copy link
Member Author

@HughParsonage HughParsonage commented Jan 4, 2018

I was thinking something like existsMethod or hasMethod, but I can't seem to work them out:

existsMethod("as.numeric")
#> [1] TRUE
existsMethod("as.Date")
#> [1] FALSE

Copy link
Member

@MichaelChirico MichaelChirico left a comment

awesome. minor comments:

  • vapply takes logical(1L)
  • I'm a big fan of sprintf instead of paste for error messaging

@HughParsonage
Copy link
Member Author

@HughParsonage HughParsonage commented Jan 6, 2018

Apologies for the enormous diff, but there were a few tricky paths, as evinced by the number of additional tests

  • the as.complex, as.raw, as.POSIXct, as.Date have to be implemented slightly differently (i.e. not simply through as()).
  • factor required a slight bugfixes for some corner cases
  • the interaction with select and drop was complicated
  • "NULL" elements weren't being implemented correctly.

I need to update the NEWS.md entry too.

@st-pasha
Copy link
Contributor

@st-pasha st-pasha commented Apr 27, 2018

@HughParsonage Small question: what do the marks next to issues #1426 and #1656 in this PR's description mean? Does it mean those issues remain unresolved? Is it because they are too hard to resolve, or you don't know how to resolve, or out-of-scope for this PR, or you don't know whether they should be implemented at all, or ..?

@HughParsonage
Copy link
Member Author

@HughParsonage HughParsonage commented Apr 27, 2018

Closest meaning would be 'partially resolved':

if select is used, colClasses need only correspond to the columns in select (with caveat)

Basically resolved: there are combinations of select and colClasses that don't have a clear definition. But otherwise resolved.

Issue #1656 fread/fwrite base data types directly for efficiency

Not fully resolved: base data types can now be implemented within fread, but not in C, and unlike currently supported types, they are not inferred from the content of the file -- they have to be specified.

In terms of why the latter isn't resolved, it's a combination of being too difficult for me and out-of-scope: this PR is meant to implement arbitrary colClasses if they are requested. Although treating dates as Date class would be plausible to infer from the content of the file, it would still be a breaking change and thus require more thought.

@mattdowle mattdowle removed this from the 1.12.0 milestone Jan 11, 2019
@mattdowle mattdowle added this to the 1.12.4 milestone May 1, 2019
@mattdowle mattdowle changed the title Check colClasses to be valid type. Closes #1634 Various colClasses enhancements May 1, 2019
@mattdowle mattdowle merged commit 1de0399 into Rdatatable:master May 2, 2019
4 checks passed
},
warning = fun <- function(e) {
etype = if (inherits(e,"error")) "error" else "warning"
warning(sprintf("Column '%s' was set by colClasses to be '%s' but fread encountered the following %s:\n\t%s\nso the column has been left as type '%s'",
Copy link
Member

@MichaelChirico MichaelChirico May 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Construct without sprintf (I don't think I have write access to Hugh's branch...)

Copy link
Member

@mattdowle mattdowle May 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind either way, but it's ok with sprintf isn't it?

Copy link
Member

@MichaelChirico MichaelChirico May 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took the cue from WRE Section 1.7:

Try not to split up messages into small pieces. In C error messages use a single format string containing all English words in the messages.
In R error messages do not construct a message with paste (such messages will not be translated) but via multiple arguments to stop or warning, or via gettextf.

I already purged other instances of this construction elsewhere in the code, I can't find the PR right now...

Copy link
Member

@mattdowle mattdowle May 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it's merged now, follow up PRs are good. I double-checked Hugh is a project member so he can create branches too. Good catch: I'd forgotten about that.

Copy link
Member

@MichaelChirico MichaelChirico May 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, working on it 👍

"complex" = as.complex(v),
"raw" = as_raw(v), # Internal implementation
"Date" = as.Date(v),
"POSIXct" = as.POSIXct(v),
Copy link
Member

@MichaelChirico MichaelChirico May 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to build customized timezones into the API? Otherwise we should use tz = 'UTC' by default? Since the same string on different machines will parse differently otherwise...

Copy link
Member

@mattdowle mattdowle May 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea: could add tz= to fread I guess. But shouldn't default be whatever as.POSIXct() does by default (which is local time iirc) ?

Copy link
Member

@MichaelChirico MichaelChirico May 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm basically arguing against R default behavior since I've always found it a bit murky to use inferred timezones --> fread behaves differently on different machines.

Adding tz to fread feels a bit strange, maybe accept tz as a "class" within colClasses a la

colClasses = list(POSIXct = 'time', tz = 'UTC')

Or (requires a bit more parsing within our code but I think feels pretty natural):

colClasses = list(POSIXct[UTC] = 'time')

Copy link
Member

@mattdowle mattdowle May 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But eventually colClasses won't be needed to get a POSIXct (it would be built in as native type). Hence tz= being argument to fread. That way, tz= would apply to all datetime columns in the file without needing to specify them in colClasses.

Copy link
Member

@MichaelChirico MichaelChirico May 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I guess tz argument could have an API like colClasses to allow flexibly specifying multiple time zones:

tz = list(UTC = c('start', 'end'), 'Australia/South' = c('start_local', 'end_local'))

and

tz = c(start = 'UTC', end = 'UTC', start_local = 'Australia/South', end_local = 'Australia/South')

Copy link
Member

@jangorecki jangorecki May 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it make sense to use tz="UTC" and just document that, is there any follow up issue to not forget about this?

Copy link
Member

@MichaelChirico MichaelChirico May 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my only hesitation is I'm not 100% sure we can do this before implementing the tz argument... e.g. some times only exist in certain time zones, so conversion may fail if we assume UTC but could read on user's machine otherwise?

Copy link
Member

@jangorecki jangorecki May 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it may fail then tryCatch, providing more complex structure to colClasses is good idea but should be added with care (ideally another PR), ideally aligning to API we will use in csvy specification. Using tz="UTC" seems to be simplest way to ensure some consistency.

\itemize{
\item{If coercion results in an error or introduces \code{NA}s, the attempt is aborted for that column with warning and the column's type is left unchanged (probably \code{character}).}
\item{Named list of vectors of column names or numbers are supported where the list names are the class names. The \code{list} form makes it easier to set a batch of columns to be a particular class; see examples. When column numbers are used in the `list` form, they refer to the column number in the file, not the column number after \code{select} or \code{drop} has been applied.}
\item{Columns are not demoted to a lower type if this would risk loss of information. You have to coerce such columns afterwards yourself, if you really require data loss.}
Copy link
Member

@MichaelChirico MichaelChirico May 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Points 1 & 3 have a lot of overlap, maybe just combine?

Copy link
Member

@mattdowle mattdowle May 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 is more about classes like POSIXct which take as input the character string.
3 is more about numeric data like 3.14 being specified as integer.
Maybe more examples and/or better wording?

Copy link
Member

@MichaelChirico MichaelChirico May 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. At worst a tiny bit repetitive but maybe that's a good thing as it's pretty crucial.

if (!is.null(names(colClasses))) { # names are column names; convert to list approach
if (!length(colClasses)) {
colClasses=NULL;
} else if (identical(colClasses, "NULL")) {
Copy link
Member

@MichaelChirico MichaelChirico May 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the use case for this? It's not mentioned in the manual...

Copy link
Member

@mattdowle mattdowle May 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. Just saw it done in Hugh's branch and liked it. Maybe just to catch beginner errors? There is a warning that it is taken to mean NULL.
I can see that it breaks the case-1-not-different rule. So a 1-column file should have the column dropped (and null data.table returned) when colClasses="NULL"? For consistency. On the other hand beginners might well try "NULL".

Copy link
Member

@mattdowle mattdowle May 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe user has a loop through types and all columns should be coerced to that type. for (class in c("integer","double","NULL")) fread(...., colClasses=class). But I can't imagine a good reason to do that.

Copy link
Member

@MichaelChirico MichaelChirico May 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems innocuous enough, just wondering if we should document it if there's a specific use case in mind... @HughParsonage any thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment