Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to add colnames to new columns #58

Open
adomingues opened this issue Apr 3, 2018 · 4 comments
Open

Option to add colnames to new columns #58

adomingues opened this issue Apr 3, 2018 · 4 comments

Comments

@adomingues
Copy link

first of all thank you so much for this package! It is part of my routine analysis for some time now. I would just like to suggest a convenience option to skip column renaming after splitting. Example:

to_split <- structure(list(Sample = c("N2_wt_rep1_untreated", "N2_wt_rep1_untreated", 
"N2_wt_rep1_untreated", "N2_wt_rep2_untreated", "N2_wt_rep2_untreated", 
"N2_wt_rep2_untreated"), Reads = c(470987L, 270891L, 56114L, 
513902L, 310722L, 67263L)), .Names = c("Sample", "Reads"), class = "data.frame", row.names = c(NA, 
-6L))
split <- cSplit(to_split, "Sample", sep="_")
split
#     Reads Sample_1 Sample_2 Sample_3  Sample_4
# 1: 470987       N2       wt     rep1 untreated
# 2: 270891       N2       wt     rep1 untreated
# 3:  56114       N2       wt     rep1 untreated
# 4: 513902       N2       wt     rep2 untreated
# 5: 310722       N2       wt     rep2 untreated
# 6:  67263       N2       wt     rep2 untreated

The new col names are not very informative, so I usually rename them in an extra step:

setnames(split,
   c("Sample_1", "Sample_2", "Sample_3", "Sample_4"),
   c("Background", "Allele", "Replicate", "Treatment")
)

This is fine, but I wonder if it would possible to skip that extra step with cSplit(to_split, "Sample", sep="_"), new_names=c("Background", "Allele", "Replicate", "Treatment")

Cheers.

@mrdwab
Copy link
Owner

mrdwab commented Apr 4, 2018

Thanks @adomingues for the comment. I've thought about this in the past. It shouldn't be too difficult to implement, so I'll look into it again.

Here are a couple of reasons I didn't implement it the first time around:

  • The cSplit function is generalized in the sense that I should be able to split a column not knowing how many columns would be in the result.
  • The cSplit function is vectorized, so a simple new_names = c(...) wouldn't work--it would have to be something like list(Sample = c("Background", "Allele", "Replicate", "Treatment")

Any thoughts on those?

@adomingues
Copy link
Author

adomingues commented Apr 4, 2018

Thanks for considering this @mrdwab. I was think about implementation, after posting and my very näive thought was to operate on the colnames after spliting. For instance greping the colnames and replacing only those:

cSplit2 <- function(indt, splitCols, newNames, ...){
   split <- cSplit(to_split, "Sample", sep="_")
   newcols <- grep(paste(splitCols, collapse="|"), colnames(split))
   colnames(split)[newcols] <- newNames
   return(split)
}

cSplit2(to_split, splitCols = "Sample", sep="_", newNames = c("Background", "Allele", "Replicate", "Treatment"))

This is of course of the opposite of what you suggested :) but I wonder it would be a good starting point.

@mrdwab
Copy link
Owner

mrdwab commented Apr 4, 2018

@adomingues, Here's a POC renamer function that I can probably drop-in at the last stages of the existing cSplit function. Here, I'm just demonstrating it as an external function:

library(splitstackshape)
library(data.table)
df <- data.frame(x = 1:3, y = c("a", "d,e", "g,h"), z = c("1", "2,3,4", "6"))

renamer <- function(data, replacement) {
  if (!is.list(replacement)) stop("replacement should be a named list")
  for (i in seq_along(replacement)) {
    old <- names(data)[startsWith(names(data), names(replacement)[i])]
    setnames(data, old = old, new = replacement[[i]])
  }
  data[]
}

cSplit(df, c("y", "z"))
#    x y_1  y_2 z_1 z_2 z_3
# 1: 1   a <NA>   1  NA  NA
# 2: 2   d    e   2   3   4
# 3: 3   g    h   6  NA  NA

renamer(cSplit(df, c("y", "z")), 
        list(y = c("A", "B"), z = c("AA", "BB", "CC")))
#    x A    B AA BB CC
# 1: 1 a <NA>  1 NA NA
# 2: 2 d    e  2  3  4
# 3: 3 g    h  6 NA NA

So, a possible final implementation might look like:

cSplit(df, c("y", "z"), sep = ",", new_names = list(y = c("A", "B"), z = c("AA", "BB", "CC")))

Alternatively, the entire API can be revisited such that, depending on the input, the function behaves differently:

  • If a simple character string of column names is provided, use the current approach.
  • If a list is provided in the splitCols argument, new names can be specified (eg: cSplit(df, splitCols = list(y = c("A", "B"), z = c("AA", "BB", "CC")), sep = ","))

Let me think about it some more, but I'm open to other ideas as well as I'm currently planning a V2 release of the package later this year.

@adomingues
Copy link
Author

If a list is provided in the splitCols argument, new names can be specified (eg: cSplit(df, splitCols = list(y = c("A", "B"), z = c("AA", "BB", "CC")), sep = ","))

This pretty much solves it, at least for me. Looking forward to V2.

@mrdwab mrdwab added this to In progress in Splitstackshape V2.0 Apr 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Splitstackshape V2.0
  
In progress
Development

No branches or pull requests

2 participants