Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return a data.table without the key #981

Closed
geneorama opened this issue Dec 8, 2014 · 10 comments
Closed

Return a data.table without the key #981

geneorama opened this issue Dec 8, 2014 · 10 comments

Comments

@geneorama
Copy link

It would be nice to be able to return a data.table on the fly without the key.

This would be useful for things like regressions where you might want to keep a key, but you don't want to include it in the regression. Of course I could use my own function, but I would prefer to use something standard.

Example function

Perhaps there's something more elegant / obvious?

keyless <- function(x){
    x[ , -which(colnames(x) %in% key(x)), with=FALSE]
}

Example usage:

library(data.table)
## Example using the rock data, with an additional column ID which 
## in a real example may be used to join different data sets.
dt <- data.table(id=paste0("rock", sprintf("%02d", 1:48)), rock)
setkey(dt, id)

## View the structure:
str(dt)

# Classes ‘data.table’ and 'data.frame':  48 obs. of  5 variables:
#  $ id   : chr  "rock01" "rock02" "rock03" "rock04" ...
#  $ area : int  4990 7002 7558 7352 7943 7979 9333 8209 8393 6425 ...
#  $ peri : num  2792 3893 3931 3869 3949 ...
#  $ shape: num  0.0903 0.1486 0.1833 0.1171 0.1224 ...
#  $ perm : num  6.3 6.3 6.3 6.3 17.1 17.1 17.1 17.1 119 119 ...
#  - attr(*, ".internal.selfref")=<externalptr> 
#  - attr(*, "sorted")= chr "id"

## Define "keyless"
keyless <- function(x){
    x[ , -which(colnames(x) %in% key(x)), with=FALSE]
}
## Do a regression
## Obviously we want to exclude the column of identifiers, so we use keyless
lm(area~., keyless(dt))

# Call:
# lm(formula = area ~ ., data = keyless(dt))
# 
# Coefficients:
# (Intercept)         peri        shape         perm  
#    -407.069        2.193     2992.314        2.549  

I know that others have mentioned this, but I couldn't find an existing issue.

Thank you

@yitang
Copy link

yitang commented Dec 8, 2014

.SD and .SDcol will do the job.

R> head(dt)
                   ymd  london  pairs berlin
1: 1900-01-01 12:00:00 0.62158 0.8151 0.2893
2: 1900-01-02 12:00:00 0.09772 0.7228 0.5576
3: 1900-01-03 12:00:00 0.65804 0.8039 0.9895
4: 1900-01-04 12:00:00 0.87387 0.2731 0.1960
5: 1900-01-05 12:00:00 0.75414 0.4138 0.4678
6: 1900-01-06 12:00:00 0.60392 0.6056 0.2084

R> lm(london ~ ., dt[, .SD, .SDcol = -key(dt)])

Call:
lm(formula = london ~ ., data = dt[, .SD, .SDcol = -key(dt)])

Coefficients:
(Intercept)        pairs       berlin  
    0.50158      0.00231     -0.00230  

enjoy native data.table :)

@geneorama
Copy link
Author

yi-tang

I just saw your response, and thank you! However, I still think it would still be useful (simpler and easy to read) to have a function that returns a data.table without the key; similar to the coredata function in the package zoo.

You make a great argument to simply rely on the native functionality, and this is probably a question of design. I think the coredata (or whatever) function would be nice, but I can see the other side here too.

BUT, I would much prefer dt[, .SD, .SDcol = -key(dt)] over dt[ , -which(colnames(dt) %in% key(dt)), with=FALSE], so thanks for that! I'll definitely use that over the original (but I would still personally prefer coredata(dt) or even keyless(dt).

-Gene

@arunsrinivasan
Copy link
Member

Gene, I've marked as FR, but at the moment, I don't see a reason "for". It seems reasonable to me to write your own function, as it's a very special case of a subset operation. Are there other compelling cases where you need this?

@jangorecki
Copy link
Member

@geneorama
What you suggests is a simple wrapper

keyless <- function(x) x[, .SD, .SDcol = -key(x)]

I understand there are cases where it is useful but data.table is still more focused on providing wide and efficient table data manipulation framework than direct function to achieve something basic as above. If you strongly believe it should be included in master you can try PR 👍

@geneorama
Copy link
Author

After six months I seem to be the only one who thinks this is a good idea, so I'll just stick with a custom function

@mattdowle
Copy link
Member

It doesn't seem like a bad idea to me. No objection to adding it. Not sure the best name. Would we need to select the key columns only sometimes as well - what would that function be called? key() already used so maybe keycolumns() and valuecolumns(), or keydata() and valuedata()? Hm.

@mattdowle mattdowle reopened this Jun 9, 2015
@geneorama
Copy link
Author

I was going to explain how I thought it was a bad idea... but my rechanged (?) my mind, and the example I worked out turned out to validate my original suggestion.

I think it could be confusing with .SDcols but it could be pretty useful otherwise.

This is an example of a pretty typical workflow for me;

EDIT: Also, I called it dekey... but I don't love that name either. You wouldn't want a devalue function, right? The zoo library uses coredata, which I don't like but can't beat.

library(data.table)
set.seed(1)
data_full <- data.table(mykey = letters,
                        group = c(rep("train", 10), rep("test", 16)),
                        x1 = rnorm(26), x2 = rnorm(26), x3 = rnorm(26), x4 = rnorm(26), 
                        x5 = rnorm(26), x6 = rnorm(26), x7 = rnorm(26), x8 = rnorm(26), 
                        y = sample(c(0,1), 26,replace=T), 
                        key = c("mykey", "group"))
dekey <- function(x) x[, .SD, .SDcol = -key(x)]

## Regress on some different column subsets
## Perhaps create copies of the subsets for future plotting and analysis 
d1 <- data_full[ , list(x2,x4,x6,x8,y), keyby=list(mykey, group)]
d2 <- data_full[ , list(x1,x3,x5,y), keyby=list(mykey, group)]

glm1 <- glm(y ~ ., data = dekey(d1[group=="test"]), family = "binomial")
glm2 <- glm(y ~ ., data = dekey(d2[group=="test"]), family = "binomial")

## To create a data.table of predictions the keys have to be added back,
## and we're relying on the data being in the same order
pred1 <- data.table(data_full[ , list(mykey, group)],
                    yhat = predict(glm1, data_full),
                    key = c("mykey", "group"))
pred2 <- data.table(data_full[ , list(mykey, group)],
                    yhat = predict(glm2, data_full),
                    key = c("mykey", "group"))

## Merge in predictions as needed
data_full[pred1]
data_full[pred2]
## Merge in predictions as needed e.g. for plotting
library(ggplot2)
ggplot(data_full[pred1]) + aes(x2, yhat, colour = group) + geom_point(size=9)
ggplot(data_full[pred2]) + aes(x2, yhat, colour = group) + geom_point(size=9)

@raneameya
Copy link

How about getDT? Would it be a good idea to have one function with the following arguments -

  • x: The data.table.
  • i: Rows to be subset, NULL by default indicating all rows.
  • j: Can be a character vector of column names or integer vector of column positions or one of "key" or "value".

@jangorecki jangorecki changed the title [Request] Return a data.table without the key Return a data.table without the key Apr 6, 2020
@joshhwuu
Copy link
Member

Quick follow-up on this issue, does anyone have suggestions on how to best close this issue?

@geneorama
Copy link
Author

I opened the issue to see what people thought, and a decade later I think it's safe to close the polls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants