Various colClasses enhancements #2545

HughParsonage · 2018-01-03T10:50:01Z

Closes #491 fread: colClasses does not covert to non-builtin types
Closes #2610 colClasses = POSIXct
Closes #1634 fread doesn't check colClasses to be valid type
Closes #2025 allow stringsAsFactors parameter to be a fraction between 0 and 1
(no issue number) When colClasses includes multiple list elements named factor (e.g. list(factor = 1, factor = 2)), all elements are respected, not just the first.
#1445 has been fixed when select is non-monotonic.

Not tackled here
#1656 fread/fwrite *base data types* directly for efficiency
#1426 if select is used, colClasses need only correspond to the columns in select; done in #3547

codecov-io · 2018-01-03T11:09:27Z

Codecov Report

Merging #2545 into master will increase coverage by 0.02%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2545      +/-   ##
==========================================
+ Coverage   97.01%   97.03%   +0.02%     
==========================================
  Files          66       66              
  Lines       12484    12559      +75     
==========================================
+ Hits        12111    12187      +76     
+ Misses        373      372       -1

Impacted Files	Coverage Δ
src/freadR.c	`96.54% <100%> (+0.25%)`	⬆️
R/fread.R	`99.4% <100%> (+0.84%)`	⬆️
src/init.c	`100% <100%> (ø)`	⬆️
R/data.table.R	`97.65% <100%> (ø)`	⬆️
src/fread.c	`98.5% <100%> (ø)`	⬆️
src/rbindlist.c	`100% <0%> (ø)`	⬆️
R/frank.R	`100% <0%> (ø)`	⬆️
R/setops.R	`98.96% <0%> (ø)`	⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 78acb70...8ef1402. Read the comment docs.

mattdowle · 2018-01-03T20:44:40Z

Hi @HughParsonage. I've invited you to be project member; you'll need to accept the invite before the change will apply. This enables you to create branches in the main repository directly so we can all push to each other's branches. Welcome!

mattdowle · 2018-01-03T20:36:05Z

inst/tests/tests.Rraw

-    test(167, names(print(ggplot(DT,aes(b,f))+geom_point()))[c(1,3)], c("data","plot"))
+    DF <- data.frame( a=1:5, b=11:50, d=c("A","B","C","D"), f=1:5, grp=1:5 )
+    res167 <- names(print(ggplot(DF,aes(b,f))+geom_point()))[c(1,3)]
+    test(167, names(print(ggplot(DT,aes(b,f))+geom_point()))[c(1,3)], res167)
    # The names() is a stronger test that it has actually plotted, but also because test() sees the invisible result


What's the reason for this change please? The test no longer seems to test anything and repeats the test expression.

Is it possible you hit this unrelated issue with the same test 167 : #2546

I understood the test to mean that using data.table as opposed to data.frame does not affect ggplots. Note res167 uses data.frame.

On my computer: with ggplot2 v 2.2.1.9000,

names(print(ggplot(DT,aes(b,f))+geom_point()))[c(1,3)] # [1] "data" "scales"

not c("data", "plot"). Indeed, none of the names are "plot". Would it be better to use all the names?

Oh I see now. I didn't spot the difference DF vs DT in those two lines. ggplot2 changed its names then and if you upgrade to CRAN version ggplot2 v2.2.1 (released a year ago) it should work. Or, is 2.2.1.9000 a more recent development version of ggplot2. If so, has it changed again?

Yep, ggplot2 2.2.1.9000 is latest dev version you have. Good test. Ok they changed again and your solution looks good to me. Maybe just a comment alongside pointing out that DF is being used to find out what the names are.

Yes, 2.2.1.9000 is the current dev version. Looking at the code for ggplot2, I can't tell what the expected behaviour ought to be:

ggplot.data.frame returns the list that I see: https://github.com/tidyverse/ggplot2/blob/7b5c185eff0dabef50e838701ab805ae09a1170f/R/plot.r#L92

ggplot_build.ggplot returns the list that the test originally expected. https://github.com/tidyverse/ggplot2/blob/152910a698121ac4fa771d110507e607d45cfc8f/R/plot-build.r#L99

Note that the test appears to proxy the actual appearance of the plot, and the plot does indeed occur on my machine.

mattdowle · 2018-01-04T00:11:32Z

inst/tests/tests.Rraw

 test(1445, fread("doublequote_newline.csv")[7:10], data.table(A=c(1L,1L,2L,1L), B=c("a","embedded \"\"field\"\"\nwith some embedded new\nlines as well","not this one","a")))
+}


This is passing successfully on AppVeyor Windows tests and on CRAN Windows tests. Something else must be going wrong on your Windows machine, to fix separately? Important to keep the test running on Windows.

Would you prefer I raise this an issue and revert this change, or relax the test to permit e.g. embedded ""field""\r\nwith some embedded new\r\nlines as well as well as embedded ""field""\nwith some embedded new\nlines as well?

Yes : new issue please and revert the change. Why does it pass on AppVeyor and CRAN WIndows but it doesn't pass on your Windows?

Ahh, I think it's because when I clone the repo, \n is replaced with \r\n by git. My bad.

fread("https://raw.githubusercontent.com/Rdatatable/data.table/master/inst/tests/doublequote_newline.csv")[["B"]][8] #> [1] "embedded \"\"field\"\"\nwith some embedded new\nlines as well" fread("inst/tests/doublequote_newline.csv")[["B"]][8] #> [1] "embedded \"\"field\"\"\r\nwith some embedded new\r\nlines as well"

Test failed locally because the test file changed `\n` to `\r\n` when cloned.

https://github.com/Rdatatable/data.table/pull/2545/files#r159559822

mattdowle

(Thinking out loud...)
The end result is great but going too far I feel. check_colClasses_validity() has quite a lot of lines to maintain relative to what it does. When I filed the issue, I had in mind just a quick 2 or 3 extra lines in fread! In freadR.c on line 32 are the R names that are matched to colClasses supplied by user. When there's no match, the idea was at some point to read data as character and then call the class method afterwards at R level, like read.csv does. If there is no class method (e.g. no as.foo exists at R level, not even defined by user), that's when it should fail, for that reason (standard R error that as.foo() does not exist). There's 'CLASS' at C level in freadR.c with that in mind but I don't recall how far we went implementing that. Listing the valid and invalid types explicitly in check_colClasses_validity isn't in the spirit of S3 dispatch. Maybe all we need is the class method dispatched to call, say, as.Date in the meantime until we implement directly at C level. And when we do implement directly at C level, there will be no change for the user other than experiencing a speedup.

HughParsonage · 2018-01-04T01:49:05Z

Do you want me to attempt a simpler version using the same approach (most of the complexity is just to construct the phrasing of the warning message), like

    if (!all(colClasses %chin% c(NA, "logical", "integer", "numeric", "double", "character", "factor"))) {
      warning("colClasses contains unsupported values.")
    }

(with corresponding tests modified).

or do you think listing all the valid classes in fread.R is simply the wrong approach at this time? Obviously such an approach would still not be in the spirit of S3 dispatch, but this PR is merely intended as a stop-gap for the issue while colClasses = "Date" etc are not yet implemented.

mattdowle · 2018-01-04T01:54:08Z

My current thought is that the dispatch to the class method should be implemented rather than this stop-gap. I'm now thinking that would be quicker. I didn't realize that before.
To confirm the new thought: integer64 is a valid type for colClasses= currently and should work. But we must be missing a test for using integer64 in colClasses= because integer64 isn't defined in check_colClasses_validity().

mattdowle · 2018-01-04T02:28:02Z

I thought it would be a case of just calling something like DT[, (col):=as(col, colClasses[i])] for each column near the end of fread() at R level.
This was the R message I had in mind :

> as("2018-01-03", "foo")
Error in as("2018-01-03", "foo") : 
  no method or default for coercing “character” to “foo”

So that's what fread user would see if they used "foo" in colClasses=.
But then Date fails too which I didn't expect :

> as("2018-01-03", "Date")
Error in as("2018-01-03", "Date") : 
  no method or default for coercing “character” to “Date”

It seems you have to use as.Date like this :

> as.Date("2018-01-03")
[1] "2018-01-03"

Maybe as() is S4 not S3? Otherwise I don't see why it doesn't dispatch to the .default() method. I looked at ?as which didn't help on first glance. We could look at read.csv to see how it does it. I seem to remember it coerces at C level.

HughParsonage · 2018-01-04T02:34:40Z

I was thinking something like existsMethod or hasMethod, but I can't seem to work them out:

existsMethod("as.numeric")
#> [1] TRUE
existsMethod("as.Date")
#> [1] FALSE

MichaelChirico

awesome. minor comments:

vapply takes logical(1L)
I'm a big fan of sprintf instead of paste for error messaging

HughParsonage · 2018-01-06T04:58:12Z

Apologies for the enormous diff, but there were a few tricky paths, as evinced by the number of additional tests

the as.complex, as.raw, as.POSIXct, as.Date have to be implemented slightly differently (i.e. not simply through as()).
factor required a slight bugfixes for some corner cases
the interaction with select and drop was complicated
"NULL" elements weren't being implemented correctly.

I need to update the NEWS.md entry too.

(Also fix spurious movement of })

st-pasha · 2018-04-27T01:01:23Z

@HughParsonage Small question: what do the ❓marks next to issues #1426 and #1656 in this PR's description mean? Does it mean those issues remain unresolved? Is it because they are too hard to resolve, or you don't know how to resolve, or out-of-scope for this PR, or you don't know whether they should be implemented at all, or ..?

HughParsonage · 2018-04-27T01:07:27Z

Closest meaning would be 'partially resolved':

if select is used, colClasses need only correspond to the columns in select (with caveat)

Basically resolved: there are combinations of select and colClasses that don't have a clear definition. But otherwise resolved.

Issue #1656 fread/fwrite base data types directly for efficiency

Not fully resolved: base data types can now be implemented within fread, but not in C, and unlike currently supported types, they are not inferred from the content of the file -- they have to be specified.

In terms of why the latter isn't resolved, it's a combination of being too difficult for me and out-of-scope: this PR is meant to implement arbitrary colClasses if they are requested. Although treating dates as Date class would be plausible to infer from the content of the file, it would still be a breaking change and thus require more thought.

…from C level to simplify R level. Not yet passing all tests; wip.

MichaelChirico · 2019-05-02T02:26:18Z

R/fread.R

+      },
+      warning = fun <- function(e) {
+        etype = if (inherits(e,"error")) "error" else "warning"
+        warning(sprintf("Column '%s' was set by colClasses to be '%s' but fread encountered the following %s:\n\t%s\nso the column has been left as type '%s'",


Construct without sprintf (I don't think I have write access to Hugh's branch...)

I don't mind either way, but it's ok with sprintf isn't it?

I took the cue from WRE Section 1.7:

Try not to split up messages into small pieces. In C error messages use a single format string containing all English words in the messages.
In R error messages do not construct a message with paste (such messages will not be translated) but via multiple arguments to stop or warning, or via gettextf.

I already purged other instances of this construction elsewhere in the code, I can't find the PR right now...

As it's merged now, follow up PRs are good. I double-checked Hugh is a project member so he can create branches too. Good catch: I'd forgotten about that.

yep, working on it 👍

MichaelChirico · 2019-05-02T02:28:50Z

R/fread.R

+             "complex" = as.complex(v),
+             "raw" = as_raw(v),  # Internal implementation
+             "Date" = as.Date(v),
+             "POSIXct" = as.POSIXct(v),


Is there any way to build customized timezones into the API? Otherwise we should use tz = 'UTC' by default? Since the same string on different machines will parse differently otherwise...

Good idea: could add tz= to fread I guess. But shouldn't default be whatever as.POSIXct() does by default (which is local time iirc) ?

I'm basically arguing against R default behavior since I've always found it a bit murky to use inferred timezones --> fread behaves differently on different machines.

Adding tz to fread feels a bit strange, maybe accept tz as a "class" within colClasses a la

colClasses = list(POSIXct = 'time', tz = 'UTC')

Or (requires a bit more parsing within our code but I think feels pretty natural):

colClasses = list(POSIXct[UTC] = 'time')

But eventually colClasses won't be needed to get a POSIXct (it would be built in as native type). Hence tz= being argument to fread. That way, tz= would apply to all datetime columns in the file without needing to specify them in colClasses.

And I guess tz argument could have an API like colClasses to allow flexibly specifying multiple time zones:

tz = list(UTC = c('start', 'end'), 'Australia/South' = c('start_local', 'end_local'))

and

tz = c(start = 'UTC', end = 'UTC', start_local = 'Australia/South', end_local = 'Australia/South')

I think it make sense to use tz="UTC" and just document that, is there any follow up issue to not forget about this?

my only hesitation is I'm not 100% sure we can do this before implementing the tz argument... e.g. some times only exist in certain time zones, so conversion may fail if we assume UTC but could read on user's machine otherwise?

if it may fail then tryCatch, providing more complex structure to colClasses is good idea but should be added with care (ideally another PR), ideally aligning to API we will use in csvy specification. Using tz="UTC" seems to be simplest way to ensure some consistency.

MichaelChirico · 2019-05-02T02:33:17Z

man/fread.Rd

+  \itemize{
+    \item{If coercion results in an error or introduces \code{NA}s, the attempt is aborted for that column with warning and the column's type is left unchanged (probably \code{character}).}
+    \item{Named list of vectors of column names or numbers are supported where the list names are the class names. The \code{list} form makes it easier to set a batch of columns to be a particular class; see examples. When column numbers are used in the `list` form, they refer to the column number in the file, not the column number after \code{select} or \code{drop} has been applied.}
+    \item{Columns are not demoted to a lower type if this would risk loss of information. You have to coerce such columns afterwards yourself, if you really require data loss.}


Points 1 & 3 have a lot of overlap, maybe just combine?

1 is more about classes like POSIXct which take as input the character string.
3 is more about numeric data like 3.14 being specified as integer.
Maybe more examples and/or better wording?

Sounds good. At worst a tiny bit repetitive but maybe that's a good thing as it's pretty crucial.

MichaelChirico · 2019-05-02T02:35:26Z

R/fread.R

-    if (!is.null(names(colClasses))) {   # names are column names; convert to list approach
+    if (!length(colClasses)) {
+      colClasses=NULL;
+    } else if (identical(colClasses, "NULL")) {


what's the use case for this? It's not mentioned in the manual...

Not sure. Just saw it done in Hugh's branch and liked it. Maybe just to catch beginner errors? There is a warning that it is taken to mean NULL.
I can see that it breaks the case-1-not-different rule. So a 1-column file should have the column dropped (and null data.table returned) when colClasses="NULL"? For consistency. On the other hand beginners might well try "NULL".

Maybe user has a loop through types and all columns should be coerced to that type. for (class in c("integer","double","NULL")) fread(...., colClasses=class). But I can't imagine a good reason to do that.

Seems innocuous enough, just wondering if we should document it if there's a specific use case in mind... @HughParsonage any thoughts?

HughParsonage added 7 commits December 24, 2017 11:31

Merge remote-tracking branch 'refs/remotes/Rdatatable/master'

99cef9d

Closes Rdatatable#2198

08e7cc6

Merge remote-tracking branch 'upstream/master'

8fff141

Add tests for Rdatatable#1634

c9a2312

Fix tests 167 (ggplot2) which otherwise fail

30c6b86

Fix platform-dependent tests (\r\n vs \n)

aedd930

Check colClasses to be valid type. Closes Rdatatable#1634

b0a7114

mattdowle reviewed Jan 3, 2018

View reviewed changes

mattdowle reviewed Jan 4, 2018

View reviewed changes

HughParsonage added 3 commits January 4, 2018 11:50

Revert omission of test on Windows

6f9bc24

Test failed locally because the test file changed `\n` to `\r\n` when cloned.

(Forgot this closing brace)

4f7e2d1

Re: Matt's review. Explain rationale of res167.

cd497b6

https://github.com/Rdatatable/data.table/pull/2545/files#r159559822

mattdowle requested changes Jan 4, 2018

View reviewed changes

Close brace at end of test()

f50ac10

mattdowle mentioned this pull request Jan 4, 2018

fread doesn't check colClasses to be valid type #1634

Closed

HughParsonage added 5 commits January 5, 2018 18:06

Support colClasses, but as.raw problematic

d543afc

Select/drop order respected with colClasses

d7c9937

Incorporate setfactor within set_colClasses

ed66a74

Fix NULL colClasses

3af4ca1

Slightly better commentary

61b1a07

MichaelChirico reviewed Jan 6, 2018

View reviewed changes

HughParsonage added 2 commits January 6, 2018 15:59

vapply should take logical(1L)

de9da35

Add coverage, fix typos in fread.R

d3f351c

(Also fix spurious movement of })

Merge branch 'master' into master

6ee00a0

jangorecki added this to the 1.12.0 milestone Jun 26, 2018

HughParsonage mentioned this pull request Jul 28, 2018

Closes #2986; fread:select was sorting integers incorrectly #2987

Merged

HughParsonage and others added 4 commits August 19, 2018 19:18

Merge branch 'master' into master

b7b647f

Fix wrong inclusion

2d09126

Merge branch 'master' into master

9fc22d9

Merge branch 'master' into master

177215c

mattdowle removed this from the 1.12.0 milestone Jan 11, 2019

mattdowle added 3 commits April 29, 2019 15:44

Merge branch 'master' into master

d1b0a04

colClassesAs (which need as_ afterwards at R level, if any) returned …

6530fcb

…from C level to simplify R level. Not yet passing all tests; wip.

passes tests locally

c6e2b72

mattdowle added this to the 1.12.4 milestone May 1, 2019

coverage

9a54c6b

mattdowle changed the title ~~Check colClasses to be valid type. Closes #1634~~ Various colClasses enhancements May 1, 2019

mattdowle added 3 commits May 1, 2019 16:27

news item tidied and moved up

3fe8be6

tidy

a3ae6e7

link added to news item

8ef1402

mattdowle merged commit 1de0399 into Rdatatable:master May 2, 2019

MichaelChirico reviewed May 2, 2019

View reviewed changes

MichaelChirico mentioned this pull request May 2, 2019

fread should read columns of Date/POSIXct types directly #1450

Closed

MichaelChirico pushed a commit that referenced this pull request May 2, 2019

follow-up to #2545 -- replace warning(sprintf

e312722

MichaelChirico mentioned this pull request May 2, 2019

follow-up to #2545 -- replace warning(sprintf #3531

Merged

mattdowle pushed a commit that referenced this pull request May 2, 2019

follow-up to #2545 -- replace warning(sprintf (#3531)

8939988

MichaelChirico mentioned this pull request May 4, 2019

CSVY wishlist #3540

Open

7 tasks

mattdowle mentioned this pull request May 7, 2019

colClasses correspond to select #3547

Merged

4 tasks

		test(1445, fread("doublequote_newline.csv")[7:10], data.table(A=c(1L,1L,2L,1L), B=c("a","embedded \"\"field\"\"\nwith some embedded new\nlines as well","not this one","a")))
		}

Various colClasses enhancements #2545

Various colClasses enhancements #2545

Conversation

HughParsonage commented Jan 3, 2018 • edited by mattdowle Loading

codecov-io commented Jan 3, 2018 • edited by codecov bot Loading

Codecov Report

mattdowle commented Jan 3, 2018

Choose a reason for hiding this comment

mattdowle Jan 4, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HughParsonage Jan 4, 2018 • edited Loading

Choose a reason for hiding this comment

mattdowle Jan 4, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdowle left a comment • edited Loading

Choose a reason for hiding this comment

HughParsonage commented Jan 4, 2018

mattdowle commented Jan 4, 2018 • edited Loading

mattdowle commented Jan 4, 2018 • edited Loading

HughParsonage commented Jan 4, 2018

MichaelChirico left a comment

Choose a reason for hiding this comment

HughParsonage commented Jan 6, 2018 • edited Loading

st-pasha commented Apr 27, 2018

HughParsonage commented Apr 27, 2018 • edited Loading

if select is used, colClasses need only correspond to the columns in select (with caveat)

Issue #1656 fread/fwrite base data types directly for efficiency

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdowle May 2, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdowle May 2, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jangorecki May 2, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jangorecki May 2, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdowle May 2, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HughParsonage commented Jan 3, 2018 •

edited by mattdowle

Loading

codecov-io commented Jan 3, 2018 •

edited by codecov bot

Loading

mattdowle Jan 4, 2018 •

edited

Loading

HughParsonage Jan 4, 2018 •

edited

Loading

mattdowle Jan 4, 2018 •

edited

Loading

mattdowle left a comment •

edited

Loading

mattdowle commented Jan 4, 2018 •

edited

Loading

mattdowle commented Jan 4, 2018 •

edited

Loading

HughParsonage commented Jan 6, 2018 •

edited

Loading

HughParsonage commented Apr 27, 2018 •

edited

Loading

mattdowle May 2, 2019 •

edited

Loading

mattdowle May 2, 2019 •

edited

Loading

jangorecki May 2, 2019 •

edited

Loading

jangorecki May 2, 2019 •

edited

Loading

mattdowle May 2, 2019 •

edited

Loading