Delete rows by reference #635

Open
arunsrinivasan opened this Issue Jun 8, 2014 · 17 comments

Comments

Projects
None yet
@arunsrinivasan
Member

arunsrinivasan commented Jun 8, 2014

Submitted by: Matt Dowle; Assigned to: Nobody; R-Forge link

Since deleting 1 column is DT[,colname:=NULL], and deleting rows is the same as deleting all columns for those rows, and we wish to use hierarchical indexes to find the rows to delete by reference, we just need a LHS to indicate "all" columns, leading to :

 DT[i,.:=NULL]   # delete rows by reference

 DT[,.:=NULL]    # error("must specify i to delete rows. To delete all rows from a table use DT[TRUE,.:=NULL], or, DT=DT[0].  This is deliberately a little harder, to avoid accidents such as "delete from table" a coomon accident in SQL.")

We can also add an attribute "read only" or "protect" to a data.table, and if the user had protected the data.table in that way, .:= would not work on it.

@Co0olCat

This comment has been minimized.

Show comment
Hide comment
@Co0olCat

Co0olCat May 7, 2015

Second that.

Thank you.
Kind regards,
TY

Co0olCat commented May 7, 2015

Second that.

Thank you.
Kind regards,
TY

@zx8754

This comment has been minimized.

Show comment
Hide comment
@mattdowle

This comment has been minimized.

Show comment
Hide comment
@mattdowle

mattdowle Oct 1, 2015

Member

Just delete by reference is not that hard. The benefit would be mainly memory efficiency rather than speed so much.

Member

mattdowle commented Oct 1, 2015

Just delete by reference is not that hard. The benefit would be mainly memory efficiency rather than speed so much.

@mattdowle

This comment has been minimized.

Show comment
Hide comment
@mattdowle

mattdowle Oct 29, 2015

Member

How about adding both
delete(DT, b>=8 | a<=3)
and
DT[b>=8 | a<=8, .ROW:=NULL]
The advantage of the latter would be combining with other features of [] such as row numbers in i, join in i and roll. All benefiting from [i,j,by] optimization.
As per : http://stackoverflow.com/questions/10790204/how-to-delete-a-row-by-reference-in-r-data-table/10791729?noredirect=1#comment54633906_10791729

Member

mattdowle commented Oct 29, 2015

How about adding both
delete(DT, b>=8 | a<=3)
and
DT[b>=8 | a<=8, .ROW:=NULL]
The advantage of the latter would be combining with other features of [] such as row numbers in i, join in i and roll. All benefiting from [i,j,by] optimization.
As per : http://stackoverflow.com/questions/10790204/how-to-delete-a-row-by-reference-in-r-data-table/10791729?noredirect=1#comment54633906_10791729

@mattdowle

This comment has been minimized.

Show comment
Hide comment
@mattdowle

mattdowle Oct 29, 2015

Member

More advanced example :

DT[ b>=8, .SD[1, .ROW:=NULL], by=group]
# remove by reference the 1st observation in each group within a subset

Is .ROW the right name for this new symbol?

Member

mattdowle commented Oct 29, 2015

More advanced example :

DT[ b>=8, .SD[1, .ROW:=NULL], by=group]
# remove by reference the 1st observation in each group within a subset

Is .ROW the right name for this new symbol?

@mattdowle mattdowle changed the title from [R-Forge #2092] Delete rows by reference to Delete rows by reference Oct 29, 2015

@eantonya

This comment has been minimized.

Show comment
Hide comment
@eantonya

eantonya Oct 29, 2015

Contributor

Re right name: doesn't .SD already carry the right meaning for that (instead of introducing a new name a la .ROW)?

Contributor

eantonya commented Oct 29, 2015

Re right name: doesn't .SD already carry the right meaning for that (instead of introducing a new name a la .ROW)?

@franknarf1

This comment has been minimized.

Show comment
Hide comment
@franknarf1

franknarf1 Oct 29, 2015

I think syntax for selecting rows to keep (which just deletes their complement) would be convenient.

delete(DT, b >= 8 | a <= 3) # or
keep(  DT, b <  8 & a >  3)

I don't know that there's a sensible way to extend this logic to work inside j. I'd just as well have Matt's second example only work via

badrows = DT[b >= 8, .I[1], by=g]$V1
delete(DT, badrows)

Just as new columns cannot be created by set (last I checked), it could be that row modifications cannot be done inside [.data.table.

I think syntax for selecting rows to keep (which just deletes their complement) would be convenient.

delete(DT, b >= 8 | a <= 3) # or
keep(  DT, b <  8 & a >  3)

I don't know that there's a sensible way to extend this logic to work inside j. I'd just as well have Matt's second example only work via

badrows = DT[b >= 8, .I[1], by=g]$V1
delete(DT, badrows)

Just as new columns cannot be created by set (last I checked), it could be that row modifications cannot be done inside [.data.table.

@andrewrech

This comment has been minimized.

Show comment
Hide comment
@andrewrech

andrewrech Aug 23, 2016

if anyone needs a quick-and-dirty solution, as I did, here is a memory-efficient function to select rows for each col then delete by reference based on a SO answer by vc273.

## ---- Deleting rows by reference using data.table*
## ---- *not exactly!

# Example dt
DT = data.table(col1 = 1:1e6)
cols = paste0('col', 2:100)
for (col in cols){ DT[, col := 1:1e6, with = F] }
keep.idxs = sample(1e6, 9e4, FALSE) # keep 90% of

delete <- function(DT, keep.idxs){
cols <- copy(names(DT))
DT_subset <- DT[[1]][keep.idxs] %>% as.data.table
setnames(DT_subset, ".", cols[1])
for (col in cols){
  DT_subset[, (col) := DT[[col]][keep.idxs]]
  set(DT, NULL, col, NULL)
}
return(DT_subset)
}

str(delete(DT, keep.idxs))
str(DT)

andrewrech commented Aug 23, 2016

if anyone needs a quick-and-dirty solution, as I did, here is a memory-efficient function to select rows for each col then delete by reference based on a SO answer by vc273.

## ---- Deleting rows by reference using data.table*
## ---- *not exactly!

# Example dt
DT = data.table(col1 = 1:1e6)
cols = paste0('col', 2:100)
for (col in cols){ DT[, col := 1:1e6, with = F] }
keep.idxs = sample(1e6, 9e4, FALSE) # keep 90% of

delete <- function(DT, keep.idxs){
cols <- copy(names(DT))
DT_subset <- DT[[1]][keep.idxs] %>% as.data.table
setnames(DT_subset, ".", cols[1])
for (col in cols){
  DT_subset[, (col) := DT[[col]][keep.idxs]]
  set(DT, NULL, col, NULL)
}
return(DT_subset)
}

str(delete(DT, keep.idxs))
str(DT)
@vinhdizzo

This comment has been minimized.

Show comment
Hide comment
@vinhdizzo

vinhdizzo Aug 25, 2016

@andrewrech I can't get your code to work. I'm on the dev version of data.table, and when I run your code, I end up with an empty data.table:

> dim(d1)
[1] 0 0

@andrewrech I can't get your code to work. I'm on the dev version of data.table, and when I run your code, I end up with an empty data.table:

> dim(d1)
[1] 0 0
@Jarno-P

This comment has been minimized.

Show comment
Hide comment
@Jarno-P

Jarno-P Nov 18, 2016

To complement @andrewrech's answer. Here is code as function and example of its usage.

delete <- function(DT, del.idxs) {           # pls note 'del.idxs' vs. 'keep.idxs'
  keep.idxs <- setdiff(DT[, .I], del.idxs);  # select row indexes to keep
  cols = names(DT);
  DT.subset <- data.table(DT[[1]][keep.idxs]); # this is the subsetted table
  setnames(DT.subset, cols[1]);
  for (col in cols[2:length(cols)]) {
    DT.subset[, (col) := DT[[col]][keep.idxs]];
    DT[, (col) := NULL];  # delete
  }
   return(DT.subset);
}

And example of its usage:

dat <- delete(dat, del.idxs)

Where "dat" is a data.table. Removing 14k rows from 1.4M rows takes 0.25 sec on my laptop.

> dim(dat)
[1] 1419393      25
> system.time(dat <- delete(dat,del.idxs))
   user  system elapsed 
   0.23    0.02    0.25 
> dim(dat)
[1] 1404715      25
> 

This is my very first GitHub post, btw.

Jarno-P commented Nov 18, 2016

To complement @andrewrech's answer. Here is code as function and example of its usage.

delete <- function(DT, del.idxs) {           # pls note 'del.idxs' vs. 'keep.idxs'
  keep.idxs <- setdiff(DT[, .I], del.idxs);  # select row indexes to keep
  cols = names(DT);
  DT.subset <- data.table(DT[[1]][keep.idxs]); # this is the subsetted table
  setnames(DT.subset, cols[1]);
  for (col in cols[2:length(cols)]) {
    DT.subset[, (col) := DT[[col]][keep.idxs]];
    DT[, (col) := NULL];  # delete
  }
   return(DT.subset);
}

And example of its usage:

dat <- delete(dat, del.idxs)

Where "dat" is a data.table. Removing 14k rows from 1.4M rows takes 0.25 sec on my laptop.

> dim(dat)
[1] 1419393      25
> system.time(dat <- delete(dat,del.idxs))
   user  system elapsed 
   0.23    0.02    0.25 
> dim(dat)
[1] 1404715      25
> 

This is my very first GitHub post, btw.

@skanskan

This comment has been minimized.

Show comment
Hide comment
@skanskan

skanskan Jan 9, 2017

Is it already implemented ?

skanskan commented Jan 9, 2017

Is it already implemented ?

@vikram-rawat

This comment has been minimized.

Show comment
Hide comment
@vikram-rawat

vikram-rawat Aug 16, 2017

is it implemented now. Its kinda necessary function.

is it implemented now. Its kinda necessary function.

@MiloParigi

This comment has been minimized.

Show comment
Hide comment
@MiloParigi

MiloParigi Jan 3, 2018

What kind of work needs to be done in order to add this functionality to data.table ? Would be glad to help, but not totally sure where to start !

The delete function could be added using @Jarno-P awnser and later on modified to be more efficient and works with [] references, don't you think ?

What kind of work needs to be done in order to add this functionality to data.table ? Would be glad to help, but not totally sure where to start !

The delete function could be added using @Jarno-P awnser and later on modified to be more efficient and works with [] references, don't you think ?

@MichaelChirico

This comment has been minimized.

Show comment
Hide comment
@MichaelChirico

MichaelChirico Jan 8, 2018

Contributor

I think the open question is the best API. data.table-like syntax would suggest the following should "work":

DT[rows_to_delete := NULL]

The functional approach of @Jarno-P would be a change from this, where row deletion would become functional & require DT <- f(DT) constructions. This may be best since := usages are truly by reference, whereas row deletions as exemplified thus far are only fast (compared to full copies), and not truly by reference.

Contributor

MichaelChirico commented Jan 8, 2018

I think the open question is the best API. data.table-like syntax would suggest the following should "work":

DT[rows_to_delete := NULL]

The functional approach of @Jarno-P would be a change from this, where row deletion would become functional & require DT <- f(DT) constructions. This may be best since := usages are truly by reference, whereas row deletions as exemplified thus far are only fast (compared to full copies), and not truly by reference.

@Jarno-P

This comment has been minimized.

Show comment
Hide comment
@Jarno-P

Jarno-P Jan 9, 2018

Although I am all but qualified to comment, should the syntax user perspective be more like:

DT[ i , .SR := NULL ]

Where the "i" is a DT-expression to select rows. .SR is similar to .SD, except it is always defined within DT and it includes references to all the rows selected by i. But such an approach may add overhead in expressions not intending to delete rows.

Alternative way is to change the behavior of .SD and have it defined also when by-expression is not used and when used without "by", .SD would refer to the whole rows instead (.SD excludes grouping columns).

Jarno-P commented Jan 9, 2018

Although I am all but qualified to comment, should the syntax user perspective be more like:

DT[ i , .SR := NULL ]

Where the "i" is a DT-expression to select rows. .SR is similar to .SD, except it is always defined within DT and it includes references to all the rows selected by i. But such an approach may add overhead in expressions not intending to delete rows.

Alternative way is to change the behavior of .SD and have it defined also when by-expression is not used and when used without "by", .SD would refer to the whole rows instead (.SD excludes grouping columns).

@matthiaskaeding

This comment has been minimized.

Show comment
Hide comment
@matthiaskaeding

matthiaskaeding Jan 18, 2018

An approach to bypass X <- f(X) might be to find out the name of X via deparse + substitute and than use the assign function. E.g. like this (adjusting the function of @Jarno-P):

del_rows <- function(X,delete) {
  
  keep <- -delete
  name_of_X <- deparse(substitute(X))
  X_names <- copy(names(X))
  X_new <- X[keep,X_names[1L],with=F]
  set(X,i=NULL,j=1L,value=NULL)
  
  for(j in seq_len(ncol(X))) {
    
    set(X_new,i=NULL,j=X_names[1L+j],value=X[[1L]][keep] )
    set(X,i=NULL,j=1L,value=NULL)
    
  }
  assign(name_of_X,value=X_new, envir = .GlobalEnv)
}

You would need to find out the environment of X for general cases.

An approach to bypass X <- f(X) might be to find out the name of X via deparse + substitute and than use the assign function. E.g. like this (adjusting the function of @Jarno-P):

del_rows <- function(X,delete) {
  
  keep <- -delete
  name_of_X <- deparse(substitute(X))
  X_names <- copy(names(X))
  X_new <- X[keep,X_names[1L],with=F]
  set(X,i=NULL,j=1L,value=NULL)
  
  for(j in seq_len(ncol(X))) {
    
    set(X_new,i=NULL,j=X_names[1L+j],value=X[[1L]][keep] )
    set(X,i=NULL,j=1L,value=NULL)
    
  }
  assign(name_of_X,value=X_new, envir = .GlobalEnv)
}

You would need to find out the environment of X for general cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment