Option to add colnames to new columns #58

adomingues · 2018-04-03T11:32:46Z

first of all thank you so much for this package! It is part of my routine analysis for some time now. I would just like to suggest a convenience option to skip column renaming after splitting. Example:

to_split <- structure(list(Sample = c("N2_wt_rep1_untreated", "N2_wt_rep1_untreated", 
"N2_wt_rep1_untreated", "N2_wt_rep2_untreated", "N2_wt_rep2_untreated", 
"N2_wt_rep2_untreated"), Reads = c(470987L, 270891L, 56114L, 
513902L, 310722L, 67263L)), .Names = c("Sample", "Reads"), class = "data.frame", row.names = c(NA, 
-6L))
split <- cSplit(to_split, "Sample", sep="_")
split
#     Reads Sample_1 Sample_2 Sample_3  Sample_4
# 1: 470987       N2       wt     rep1 untreated
# 2: 270891       N2       wt     rep1 untreated
# 3:  56114       N2       wt     rep1 untreated
# 4: 513902       N2       wt     rep2 untreated
# 5: 310722       N2       wt     rep2 untreated
# 6:  67263       N2       wt     rep2 untreated

The new col names are not very informative, so I usually rename them in an extra step:

setnames(split,
   c("Sample_1", "Sample_2", "Sample_3", "Sample_4"),
   c("Background", "Allele", "Replicate", "Treatment")
)

This is fine, but I wonder if it would possible to skip that extra step with cSplit(to_split, "Sample", sep="_"), new_names=c("Background", "Allele", "Replicate", "Treatment")

Cheers.

The text was updated successfully, but these errors were encountered:

mrdwab · 2018-04-04T02:47:07Z

Thanks @adomingues for the comment. I've thought about this in the past. It shouldn't be too difficult to implement, so I'll look into it again.

Here are a couple of reasons I didn't implement it the first time around:

The cSplit function is generalized in the sense that I should be able to split a column not knowing how many columns would be in the result.
The cSplit function is vectorized, so a simple new_names = c(...) wouldn't work--it would have to be something like list(Sample = c("Background", "Allele", "Replicate", "Treatment")

Any thoughts on those?

adomingues · 2018-04-04T06:57:59Z

Thanks for considering this @mrdwab. I was think about implementation, after posting and my very näive thought was to operate on the colnames after spliting. For instance greping the colnames and replacing only those:

cSplit2 <- function(indt, splitCols, newNames, ...){
   split <- cSplit(to_split, "Sample", sep="_")
   newcols <- grep(paste(splitCols, collapse="|"), colnames(split))
   colnames(split)[newcols] <- newNames
   return(split)
}

cSplit2(to_split, splitCols = "Sample", sep="_", newNames = c("Background", "Allele", "Replicate", "Treatment"))

This is of course of the opposite of what you suggested :) but I wonder it would be a good starting point.

mrdwab · 2018-04-04T07:27:28Z

@adomingues, Here's a POC renamer function that I can probably drop-in at the last stages of the existing cSplit function. Here, I'm just demonstrating it as an external function:

library(splitstackshape)
library(data.table)
df <- data.frame(x = 1:3, y = c("a", "d,e", "g,h"), z = c("1", "2,3,4", "6"))

renamer <- function(data, replacement) {
  if (!is.list(replacement)) stop("replacement should be a named list")
  for (i in seq_along(replacement)) {
    old <- names(data)[startsWith(names(data), names(replacement)[i])]
    setnames(data, old = old, new = replacement[[i]])
  }
  data[]
}

cSplit(df, c("y", "z"))
#    x y_1  y_2 z_1 z_2 z_3
# 1: 1   a <NA>   1  NA  NA
# 2: 2   d    e   2   3   4
# 3: 3   g    h   6  NA  NA

renamer(cSplit(df, c("y", "z")), 
        list(y = c("A", "B"), z = c("AA", "BB", "CC")))
#    x A    B AA BB CC
# 1: 1 a <NA>  1 NA NA
# 2: 2 d    e  2  3  4
# 3: 3 g    h  6 NA NA

So, a possible final implementation might look like:

cSplit(df, c("y", "z"), sep = ",", new_names = list(y = c("A", "B"), z = c("AA", "BB", "CC")))

Alternatively, the entire API can be revisited such that, depending on the input, the function behaves differently:

If a simple character string of column names is provided, use the current approach.
If a list is provided in the splitCols argument, new names can be specified (eg: cSplit(df, splitCols = list(y = c("A", "B"), z = c("AA", "BB", "CC")), sep = ","))

Let me think about it some more, but I'm open to other ideas as well as I'm currently planning a V2 release of the package later this year.

adomingues · 2018-04-04T07:45:19Z

If a list is provided in the splitCols argument, new names can be specified (eg: cSplit(df, splitCols = list(y = c("A", "B"), z = c("AA", "BB", "CC")), sep = ","))

This pretty much solves it, at least for me. Looking forward to V2.

mrdwab self-assigned this Apr 4, 2018

mrdwab added enhancement question labels Apr 4, 2018

mrdwab added this to In progress in Splitstackshape V2.0 Apr 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to add colnames to new columns #58

Option to add colnames to new columns #58

adomingues commented Apr 3, 2018

mrdwab commented Apr 4, 2018

adomingues commented Apr 4, 2018 •

edited

mrdwab commented Apr 4, 2018

adomingues commented Apr 4, 2018

Option to add colnames to new columns #58

Option to add colnames to new columns #58

Comments

adomingues commented Apr 3, 2018

mrdwab commented Apr 4, 2018

adomingues commented Apr 4, 2018 • edited

mrdwab commented Apr 4, 2018

adomingues commented Apr 4, 2018

adomingues commented Apr 4, 2018 •

edited