Allow a single column to be used as rownames in as.matrix #2702

sritchie73 · 2018-03-23T03:50:41Z

Implements #2692

Added a rownames argument to as.matrix.data.table and as.data.frame.data.table with the following behaviour:

rownames = TRUE takes key(x) as the rownames of the new matrix / data.frame if it is a single column, or the first column if !haskey(x) or length(key(x)) > 1.
rownames = "column" takes the named column as the rownames of the new matrix / data.frame.
rownames = 3 looks up the column by index and uses that column as the rownames of the new matrix / data.frame.

Use cases include:

Converting a long data.table to a matrix via dcast()
Casting to a data.frame for other package's functions that expect data.frame arguments to have rownames for subsetting / row matching in their internal workings.

I've added documentation pages for as.matrix.data.table and as.data.frame.data.table, describing the use and behaviour of the rownames() argument, and highlight examples where it might be useful. The as.data.frame.data.table documentation also highlights the setDF function for cases where rownames nor a copy of the data.table are not required.

I'd appreciate feedback on the amount of error checking implemented (I tend to be overzealous here), and based on that additional unit tests can be implemented.

MichaelChirico · 2018-03-23T03:55:29Z

I'd say no need for as.data.frame.data.table to be extended -- setDF is preferred and has a rownames argument already, unless I'm missing something.

There is #1719...

sritchie73 · 2018-03-23T04:06:08Z

I did see that argument in setDF. Does it allow you to use a column in dt as the rownames as suggested in #1719 ? Its documentation still suggests you must manually provide the vector yourself.

My understanding was also that as.data.frame and setDF have two different use cases:

setDF for when you want to coerce a data.table to a data.frame by reference, and
as.data.frame when you want to make a copy that is a data.frame.

Otherwise why not just alias as.data.frame.data.table to setDF? i.e.:

as.data.frame.data.table <- function(x, ...) { setDF(x) }

MichaelChirico · 2018-03-23T04:09:37Z

In fact I'd be fine with setDF(copy(x)), which is essentially what's done currently:

data.table:::as.data.frame.data.table
function (x, ...) 
{
    ans = copy(x)
    setattr(ans, "row.names", .set_row_names(nrow(x)))
    setattr(ans, "class", "data.frame")
    setattr(ans, "sorted", NULL)
    setattr(ans, ".internal.selfref", NULL)
    ans
}


# vs the is.data.table(x) branch of setDF:
if (is.null(rownames)) {
  rn <- .set_row_names(nrow(x))
}
else {
  if (length(rownames) != nrow(x))stop("rownames incorrect length; expected ", nrow(x), " names, got ", length(rownames))
  rn <- rownames
}
setattr(x, "row.names", rn)
setattr(x, "class", "data.frame")
setattr(x, "sorted", NULL)
setattr(x, ".internal.selfref", NULL)

More that I'm advocating for dev time to improve setDF over as.data.frame.data.table.

Thanks for the PR btw!

sritchie73 · 2018-03-23T04:12:42Z

Ah I see, that makes sense.

The tricky part will be trying to remove the column and adding it as the rownames attributes all by reference. I don't see why it couldn't theoretically be done, but might require some C code to implement.

sritchie73 · 2018-03-23T23:40:50Z

I've rolled this back so that rownames is only implemented for as.matrix.

data.frames require more thought and work. I agree the development work should go into setDF, and then as.data.frame should just call setDF. Complicating this however, the as.data.frame generic has a row.names argument that has the same functionality as rownames in setDF currently (i.e. a character vector of rownames can be supplied). We'd need to think about how to handle this conflict in arguments.

codecov-io · 2018-03-23T23:47:25Z

Codecov Report

Merging #2702 into master will increase coverage by 0.02%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2702      +/-   ##
==========================================
+ Coverage    93.4%   93.42%   +0.02%     
==========================================
  Files          61       61              
  Lines       12236    12276      +40     
==========================================
+ Hits        11429    11469      +40     
  Misses        807      807

Impacted Files	Coverage Δ
R/data.table.R	`97.85% <100%> (+0.03%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1ab4baa...cde74a2. Read the comment docs.

sritchie73 · 2018-03-23T23:48:15Z

A couple of queries:

-rownames could also accept a vector to use as the rownames of the matrix - Should I add this?

I've had to use with = FALSE to drop the column containing the rownames, but I understand with = FALSE is deprecated (Items j and with need updating in ?data.table #2620) - is there a better way of doing this?

Merge branch 'master' into as_matrix_rownames # Conflicts: # NEWS.md # inst/tests/tests.Rraw

HughParsonage · 2018-03-24T14:49:53Z

R/data.table.R

+      # E.g. because rownames is some sort of object that cant be converted to a column index
+      stop("rownames must be TRUE, a column index, or a column name in x")
+    } else {
+      if (is.logical(rownames) && isTRUE(rownames)) {


isTRUE(rownames) is sufficient?

You're right. I've changed this statement to identical(rownames, TRUE) which I think is clearer (and is used elsewhere in data.table.R).

HughParsonage · 2018-03-24T14:51:34Z

R/data.table.R

+    rn <- x[[rnc]]
+    dm <- dim(x) - c(0, 1)
+    cn <- names(x)[-rnc]
+    X <- x[, -rnc, with = FALSE]


I think x[, .SD, .SDcols = c(cn)] or x[, (rn) := NULL] could work -- not 100% on what the variables are, or the best way to do this within data.table.

Thanks @HughParsonage - I like the .SDcols approach. I had also thought about using the new .. syntax, but then that raises the problem with R CMD check --as-cran complaining that ..rnc is a global variable with no visible binding.

mattdowle · 2018-03-27T00:07:54Z

R/data.table.R

+    rn <- x[[rnc]]
+    dm <- dim(x) - c(0, 1)
+    cn <- names(x)[-rnc]
+    X <- x[, .SD, .SDcols = cn]


x <- x[, -..rnc] now works here in dev instead of these 2 lines. A copy is needed in this case at this point, iiuc, otherwise, x[, (rnc):=NULL] to remove that column by reference, currently. And maybe x[, ..rnc := NULL] in future.

Thanks, yes I discovered -.. was implemented as I was working on this. Ultimately I decided not to use -..rnc as you then have to define a dummy ..rnc at the top of the function to avoid R CMD check --as-cran complaining ..rnc is a global variable with no visible binding.

Interesting, I see. That is a shame about the no visible binding note.
In general we've tried to use idiomatic data.table internally, and then deal with the no-visible-binding note explicitly by adding a dummy NULL as you say. This way, when folk look at the internals they see how we use data.table ourselves. If and when there is ever a solution for the CRAN note, we can fix it one distinct place by taking away the NULL definitions.

MichaelChirico · 2018-03-27T14:52:52Z

R/data.table.R

-  dm <- dim(x)
-  cn <- names(x)
+as.matrix.data.table <- function(x, rownames, ...) {
+  rn <- NULL


not sure we have a style guide on this, but I note that the corresponding CRAN cheat for [.data.table symbols are defined in the package environment rather than the function body:

https://github.com/Rdatatable/data.table/blob/master/R/data.table.R#L11

side is cutting that tiny bit of overhead for use cases that might repeatedly call this method; - side is increasing potential for unintentional collisions (so if moving outside the body, perhaps use some more obscure name)

they are defined there because they are exported. rn won't be used by user.

Yeap - rn is internal here, it will contain the vector of rownames to put in the matrix (after all the processing in if (!missing(rownames)) {}. rnc will contain the index of the column in x to be dropped.

MichaelChirico · 2018-03-27T15:03:29Z

R/data.table.R

+      stop("rownames must be a single column in x or a vector of row names of length nrow(x)")
+    } else if (is.na(rownames)) {
+      warning("rownames is NA, ignoring rownames")
+    } else if (identical(rownames, FALSE)) {


i find it a bit weird that rownames = TRUE is accepted, but rownames = FALSE is incorrect usage and results in this warning. Setting rownames dynamically by some condition evaluating to TRUE or FALSE seems like a natural use case.

We can drop the warning

(hopefully no typos; on browser)

sritchie73 · 2018-03-28T00:14:16Z

@MichaelChirico can you tell me how to pull your changes to my branch? They're not appearing for me locally so I'm not sure how GitHub works here.

MichaelChirico · 2018-03-28T00:25:33Z

you should be able to git pull <name for data.table upstream, usually either upstream or origin> <name of this branch>

…

On Wed, Mar 28, 2018, 8:14 AM Scott Ritchie ***@***.***> wrote: @MichaelChirico <https://github.com/MichaelChirico> can you tell me how to pull your changes to my branch? They're not appearing for me locally so I'm not sure how GitHub works here. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2702 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHQQdSaXzYU7EWFT59MZaQX4AGY9RYahks5titXcgaJpZM4S4Ite> .

sritchie73 · 2018-03-28T00:49:09Z

R/data.table.R

-            warning("rownames is TRUE but multiple keys found in key(x), using first column instead")
-            rownames <- 1
+          if (length(rownames) > 1L) {
+            warning(sprintf("rownames is TRUE but multiple keys [%s] found for x; defaulting to first key column [%s]",


@MichaelChirico not quite - I've defaulted to just using the first column (rather than the first key column) if there are multiple keys. We can certainly take the first key column if you think its the right approach? (Or perhaps the last key since that is how the rows will be ordered?)

sritchie73 · 2018-03-28T00:53:52Z

Thanks @MichaelChirico it turned out git was lying to me that my branch was up to date with origin.

I've tried to add comments to your changes but I don't quite understand the GitHub review feature so i'll copy in the comments here:

I will drop the warning where rownames = FALSE
Your change to the warning message where multiple keys are detected is not quite correct - I've defaulted to use x[,1] as the rownames in that case rather than x[,key(x)[1]]. I can certainly change this – but would think maybe it could be useful to take the last key (since the rows of x will be ordered by that column)?

MichaelChirico · 2018-03-29T03:46:00Z

@sritchie73 x[ , 1] sounds fine. just a problem with my editing in-browser.

sritchie73 · 2018-04-01T06:00:45Z

@MichaelChirico i've updated the code and tests so that rownames = FALSE no longer generates a warning. I've also removed the similar warnings generated when rownames = NULL or rownames = NA.

MichaelChirico · 2018-04-01T09:27:29Z

LGTM

sritchie73 added 4 commits March 23, 2018 09:03

rownames argument to as.matrix and as.data.frame

0389de1

Added documentation

b1590ae

some unit tests

1323be0

added news

ddaeb6a

sritchie73 added 2 commits March 24, 2018 10:03

Resolve CRAN notes about global variable bindings

0ab7e4f

Reverted as.data.frame.data.table

8477788

sritchie73 changed the title ~~Allow a single column to be used as rownames in as.matrix and as.data.frame~~ Allow a single column to be used as rownames in as.matrix Mar 23, 2018

sritchie73 added 3 commits March 24, 2018 11:41

Unit tests for errors and warnings

ac52d9a

Fixed error message in tests and test increment numbers

b9eab65

Merging changes from upstream

c0cca0d

Merge branch 'master' into as_matrix_rownames # Conflicts: # NEWS.md # inst/tests/tests.Rraw

HughParsonage reviewed Mar 24, 2018

View reviewed changes

sritchie73 added 5 commits March 25, 2018 17:52

Removed with=FALSE

11da144

Enhanced clarity of an error check

5b4bca7

replaced isTRUE

c5ae94b

Vector of rownames may be used in as.matrix

f07b813

Fixed number in tests to reflect PR

3d4681c

mattdowle reviewed Mar 27, 2018

View reviewed changes

MichaelChirico reviewed Mar 27, 2018

View reviewed changes

MichaelChirico added 2 commits March 27, 2018 23:13

mainly cosmetic changes

d9a4a54

(hopefully no typos; on browser)

typo (rownames is integer not string here)

de48d84

sritchie73 commented Mar 28, 2018

View reviewed changes

mattdowle and others added 5 commits March 30, 2018 15:35

Merge branch 'master' into as_matrix_rownames

2810538

merged master

8348172

merge upstream PR changes

12cace8

NULL, FALSE, and NA now passthrough instead of warning

895554e

Fixed incorrect warning message

a887594

MichaelChirico approved these changes Apr 1, 2018

View reviewed changes

Removed extraneous newline before tests

cde74a2

mattdowle added this to the v1.10.6 milestone Apr 7, 2018

mattdowle merged commit e3dd285 into Rdatatable:master Apr 7, 2018

sritchie73 mentioned this pull request Apr 11, 2018

Codecov result posts immediately in PR, before test suite runs #2731

Closed

sritchie73 mentioned this pull request Jun 17, 2018

Closes #2930 -- bugfix to as.matrix.data.table() #2938

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow a single column to be used as rownames in as.matrix #2702

Allow a single column to be used as rownames in as.matrix #2702

sritchie73 commented Mar 23, 2018

MichaelChirico commented Mar 23, 2018 •

edited

Loading

sritchie73 commented Mar 23, 2018

MichaelChirico commented Mar 23, 2018 •

edited

Loading

sritchie73 commented Mar 23, 2018

sritchie73 commented Mar 23, 2018

codecov-io commented Mar 23, 2018 •

edited

Loading

sritchie73 commented Mar 23, 2018

HughParsonage Mar 24, 2018

sritchie73 Mar 25, 2018

HughParsonage Mar 24, 2018

sritchie73 Mar 25, 2018

mattdowle Mar 27, 2018

sritchie73 Mar 27, 2018

mattdowle Mar 27, 2018

MichaelChirico Mar 27, 2018

jangorecki Mar 27, 2018

sritchie73 Mar 28, 2018

MichaelChirico Mar 27, 2018

sritchie73 Mar 28, 2018

sritchie73 commented Mar 28, 2018

MichaelChirico commented Mar 28, 2018 via email

sritchie73 Mar 28, 2018

sritchie73 commented Mar 28, 2018

MichaelChirico commented Mar 29, 2018

sritchie73 commented Apr 1, 2018

MichaelChirico commented Apr 1, 2018

Allow a single column to be used as rownames in as.matrix #2702

Allow a single column to be used as rownames in as.matrix #2702

Conversation

sritchie73 commented Mar 23, 2018

MichaelChirico commented Mar 23, 2018 • edited Loading

sritchie73 commented Mar 23, 2018

MichaelChirico commented Mar 23, 2018 • edited Loading

sritchie73 commented Mar 23, 2018

sritchie73 commented Mar 23, 2018

codecov-io commented Mar 23, 2018 • edited Loading

Codecov Report

sritchie73 commented Mar 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sritchie73 commented Mar 28, 2018

MichaelChirico commented Mar 28, 2018 via email

Choose a reason for hiding this comment

sritchie73 commented Mar 28, 2018

MichaelChirico commented Mar 29, 2018

sritchie73 commented Apr 1, 2018

MichaelChirico commented Apr 1, 2018

MichaelChirico commented Mar 23, 2018 •

edited

Loading

MichaelChirico commented Mar 23, 2018 •

edited

Loading

codecov-io commented Mar 23, 2018 •

edited

Loading