-
Notifications
You must be signed in to change notification settings - Fork 974
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow a single column to be used as rownames in as.matrix #2702
Conversation
I'd say no need for There is #1719... |
I did see that argument in My understanding was also that
Otherwise why not just alias
|
In fact I'd be fine with
More that I'm advocating for dev time to improve Thanks for the PR btw! |
Ah I see, that makes sense. The tricky part will be trying to remove the column and adding it as the rownames attributes all by reference. I don't see why it couldn't theoretically be done, but might require some C code to implement. |
I've rolled this back so that
|
Codecov Report
@@ Coverage Diff @@
## master #2702 +/- ##
==========================================
+ Coverage 93.4% 93.42% +0.02%
==========================================
Files 61 61
Lines 12236 12276 +40
==========================================
+ Hits 11429 11469 +40
Misses 807 807
Continue to review full report at Codecov.
|
A couple of queries: -
|
Merge branch 'master' into as_matrix_rownames # Conflicts: # NEWS.md # inst/tests/tests.Rraw
R/data.table.R
Outdated
# E.g. because rownames is some sort of object that cant be converted to a column index | ||
stop("rownames must be TRUE, a column index, or a column name in x") | ||
} else { | ||
if (is.logical(rownames) && isTRUE(rownames)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isTRUE(rownames)
is sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I've changed this statement to identical(rownames, TRUE)
which I think is clearer (and is used elsewhere in data.table.R).
R/data.table.R
Outdated
rn <- x[[rnc]] | ||
dm <- dim(x) - c(0, 1) | ||
cn <- names(x)[-rnc] | ||
X <- x[, -rnc, with = FALSE] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think x[, .SD, .SDcols = c(cn)]
or x[, (rn) := NULL]
could work -- not 100% on what the variables are, or the best way to do this within data.table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @HughParsonage - I like the .SDcols
approach. I had also thought about using the new ..
syntax, but then that raises the problem with R CMD check --as-cran
complaining that ..rnc
is a global variable with no visible binding.
rn <- x[[rnc]] | ||
dm <- dim(x) - c(0, 1) | ||
cn <- names(x)[-rnc] | ||
X <- x[, .SD, .SDcols = cn] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x <- x[, -..rnc]
now works here in dev instead of these 2 lines. A copy is needed in this case at this point, iiuc, otherwise, x[, (rnc):=NULL]
to remove that column by reference, currently. And maybe x[, ..rnc := NULL]
in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, yes I discovered -..
was implemented as I was working on this. Ultimately I decided not to use -..rnc
as you then have to define a dummy ..rnc
at the top of the function to avoid R CMD check --as-cran
complaining ..rnc
is a global variable with no visible binding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I see. That is a shame about the no visible binding note.
In general we've tried to use idiomatic data.table internally, and then deal with the no-visible-binding note explicitly by adding a dummy NULL as you say. This way, when folk look at the internals they see how we use data.table ourselves. If and when there is ever a solution for the CRAN note, we can fix it one distinct place by taking away the NULL definitions.
dm <- dim(x) | ||
cn <- names(x) | ||
as.matrix.data.table <- function(x, rownames, ...) { | ||
rn <- NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure we have a style guide on this, but I note that the corresponding CRAN cheat for [.data.table
symbols are defined in the package environment rather than the function body:
https://github.com/Rdatatable/data.table/blob/master/R/data.table.R#L11
- side is cutting that tiny bit of overhead for use cases that might repeatedly call this method; - side is increasing potential for unintentional collisions (so if moving outside the body, perhaps use some more obscure name)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they are defined there because they are exported. rn
won't be used by user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap - rn
is internal here, it will contain the vector of rownames to put in the matrix (after all the processing in if (!missing(rownames)) {}
. rnc
will contain the index of the column in x
to be dropped.
R/data.table.R
Outdated
stop("rownames must be a single column in x or a vector of row names of length nrow(x)") | ||
} else if (is.na(rownames)) { | ||
warning("rownames is NA, ignoring rownames") | ||
} else if (identical(rownames, FALSE)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i find it a bit weird that rownames = TRUE
is accepted, but rownames = FALSE
is incorrect usage and results in this warning
. Setting rownames
dynamically by some condition evaluating to TRUE
or FALSE
seems like a natural use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can drop the warning
(hopefully no typos; on browser)
@MichaelChirico can you tell me how to pull your changes to my branch? They're not appearing for me locally so I'm not sure how GitHub works here. |
you should be able to git pull <name for data.table upstream, usually
either upstream or origin> <name of this branch>
…On Wed, Mar 28, 2018, 8:14 AM Scott Ritchie ***@***.***> wrote:
@MichaelChirico <https://github.com/MichaelChirico> can you tell me how
to pull your changes to my branch? They're not appearing for me locally so
I'm not sure how GitHub works here.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2702 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHQQdSaXzYU7EWFT59MZaQX4AGY9RYahks5titXcgaJpZM4S4Ite>
.
|
R/data.table.R
Outdated
warning("rownames is TRUE but multiple keys found in key(x), using first column instead") | ||
rownames <- 1 | ||
if (length(rownames) > 1L) { | ||
warning(sprintf("rownames is TRUE but multiple keys [%s] found for x; defaulting to first key column [%s]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MichaelChirico not quite - I've defaulted to just using the first column (rather than the first key column) if there are multiple keys. We can certainly take the first key column if you think its the right approach? (Or perhaps the last key since that is how the rows will be ordered?)
Thanks @MichaelChirico it turned out I've tried to add comments to your changes but I don't quite understand the GitHub review feature so i'll copy in the comments here:
|
@sritchie73 |
@MichaelChirico i've updated the code and tests so that |
LGTM |
Implements #2692
Added a
rownames
argument toas.matrix.data.table
andas.data.frame.data.table
with the following behaviour:rownames = TRUE
takeskey(x)
as the rownames of the new matrix / data.frame if it is a single column, or the first column if!haskey(x)
orlength(key(x)) > 1
.rownames = "column"
takes the named column as the rownames of the new matrix / data.frame.rownames = 3
looks up the column by index and uses that column as the rownames of the new matrix / data.frame.Use cases include:
data.table
to a matrix viadcast()
data.frame
for other package's functions that expect data.frame arguments to have rownames for subsetting / row matching in their internal workings.I've added documentation pages for
as.matrix.data.table
andas.data.frame.data.table
, describing the use and behaviour of therownames()
argument, and highlight examples where it might be useful. Theas.data.frame.data.table
documentation also highlights thesetDF
function for cases where rownames nor a copy of thedata.table
are not required.I'd appreciate feedback on the amount of error checking implemented (I tend to be overzealous here), and based on that additional unit tests can be implemented.