cluster .id output restarts for each chromosome #171

robertamezquita · 2017-02-28T18:16:25Z

This may be a philosophical issue more than coding, but for bed_cluster, should the .id restart for each chromosome? Ideally, I would think to replicate output from other tools such as bedtools it would make more sense to continue sequentially through the chromosomes rather than restarting.

The text was updated successfully, but these errors were encountered:

jayhesselberth · 2017-02-28T18:26:32Z

So in the example below, you are suggesting the .ids would 1,2,3,4 instead of 1,2,1,2? I guess I didn't appreciate that is the bedtools default, which we have tried to adhere to, to avoid surprises (like this one).

library(valr)
library(tidyverse)

x <- tibble::tribble(
  ~chrom, ~start, ~end,
  'chr1', 1, 100,
  'chr1', 150, 250,
  'chr2', 1, 100,
  'chr2', 150, 250
)

bed_cluster(x)
#> # A tibble: 4 × 4
#>   chrom start   end   .id
#>   <chr> <dbl> <dbl> <int>
#> 1  chr1     1   100     1
#> 2  chr1   150   250     2
#> 3  chr2     1   100     1
#> 4  chr2   150   250     2

I think we can just ditch the group_by operation at: https://github.com/rnabioco/valr/blob/master/R/bed_cluster.r#L48

robertamezquita · 2017-02-28T18:32:24Z

Exactly. Yes, I think that would make it a very easy fix!

* track ids independent of groups in bed_cluster #171 * fixes #171 * sort input by default to avoid additional explict sort

robertamezquita · 2017-03-01T12:48:38Z

Awesome, also just wanted to say thanks for developing such a powerful native R toolset for working with genomic intervals!! Its a pain to use bedtools via the system interface, and makes code portability difficult. Plus, its not tidyverse friendly! Thank you for all the hard work into making this awesome!

robertamezquita · 2017-03-01T13:14:40Z

Hmm seems like the latest version is returning the start column as the .id. Just pulled the newest master with #172 merged in and here's the output using the example code snippet in the function of bed_cluster

x <- tibble::tribble(
                 ~chrom, ~start, ~end,
                 "chr1", 100,  200,
                 "chr1", 180,  250,
                 "chr1", 250,  500,
                 "chr1", 501,  1000
             )

bed_cluster(x)

 # A tibble: 4 × 4                                                                                                                                                                                                 
   chrom start   end   .id
   <chr> <dbl> <dbl> <int>
 1  chr1   100   200   100
 2  chr1   180   250   100
 3  chr1   250   500   100
 4  chr1   501  1000   501

robertamezquita · 2017-03-01T13:29:03Z

The issue seems to be in valr:::merge_impl as thats whats producing the faulty .id_merge column

kriemo · 2017-03-01T13:30:20Z

I'm getting the correct output using 772ee26. Can you try reinstalling from the most recent commit on master?

Thanks

devtools::install_github("rnabioco/valr")
library(valr)
x <- tibble::tribble(
    ~chrom, ~start, ~end,
    "chr1", 100,  200,
    "chr1", 180,  250,
    "chr1", 250,  500,
    "chr1", 501,  1000
)

bed_cluster(x)
#> # A tibble: 4 × 4
#>  chrom start   end   .id
#>  <chr> <dbl> <dbl> <int>
#> 1  chr1   100   200     1
#> 2  chr1   180   250     1
#> 3  chr1   250   500     1
#> 4  chr1   501  1000     2

robertamezquita · 2017-03-01T13:35:37Z

Thanks! Must've been a mixup in my environment...sorry about that. Might add to the example a "chr2" just to show the continuous clustering .id numbers!

…

On Wed, Mar 1, 2017 at 8:30 AM, kriemo ***@***.***> wrote: I'm getting the correct output using 772ee26 <772ee26>. Can you try reinstalling the most recent github version? Thanks devtools::install_github("rnabioco/valr") library(valr)x <- tibble::tribble( ~chrom, ~start, ~end, "chr1", 100, 200, "chr1", 180, 250, "chr1", 250, 500, "chr1", 501, 1000 ) bed_cluster(x)#> # A tibble: 4 × 4#> chrom start end .id#> <chr> <dbl> <dbl> <int>#> 1 chr1 100 200 1#> 2 chr1 180 250 1#> 3 chr1 250 500 1#> 4 chr1 501 1000 2 — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#171 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACU2LdHExGCcCJ3_3jW5AUnO6uHo_lX4ks5rhXLsgaJpZM4MOvSB> .

-- Robert A. Amezquita HHMI Gilliam Fellow | Blavatnik Associate PhD Candidate | Kleinstein & Kaech Labs Yale University | Department of Immunobiology 300 George St, Suite 505 New Haven, CT 06511-6663 Mobile: 858-245-3350 <(858)%20245-3350> Email: robert.amezquita@yale.edu

- update news - roxygen bump - re-pkg

jayhesselberth · 2017-03-01T15:52:37Z

I think it is worth revisiting this. Having the numbering restart for each group is the input could be valuable, and is conceptually cleaner. And the numbering is easily combined with group_by for subsequent analyses. I realize this isn't bedtools behavior, but some of its behaviors can be improved.

If we revert, then you would simply have chrom and .id groups:

bed_cluster(x) %>%
  group_by(chrom, .id) %>%
  summarize(...)

This is also related to a rewrite of the Rcpp side where I am considering making use of group_by operations more explicit.

# implicitly grouped by chrom in current version
bed_intersect(x, y)

# explicit grouping by chrom
x <- group_by(x, chrom)
y <- group_by(y, chrom)

bed_intersect(x, y)

robertamezquita · 2017-03-01T15:59:55Z

True, my thought was for exporting/communication and summarisation it might be simpler to have continuous IDs through the whole genome, but programming wise yes, it would be a simple grouping by both chrom and .id. I'm not sure if either solution is significantly more improved than the other, its more of a philosophical difference in this case (I don't think comparing across chromosomes the first cluster for instance would prove useful for instance, but agree that conceptually its cleaner). In this framework, I realize that grouping by chrom for operations is a pretty common structure, so it might make sense then to go back to the original version with numbers restarting per chromosome.

…

On Wed, Mar 1, 2017 at 10:52 AM, Jay Hesselberth ***@***.***> wrote: I think it is worth revisiting this. Having the numbering restart for each group is the input could be valuable, and is conceptually cleaner. And the numbering is easily combined with group_by for subsequent analyses. I realize this isn't bedtools behavior, but some of its behaviors can be improved. If we revert, then you would simply have chrom and .id groups: bed_cluster(x) %>% group_by(chrom, .id) %>% summarize(...) This is also related to a rewrite of the Rcpp side where I am considering making use of group_by operations more explicit. # implicitly grouped by chrom in current version bed_intersect(x, y) # explicit grouping by chromx <- group_by(x, chrom)y <- group_by(y, chrom) bed_intersect(x, y) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#171 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACU2LSgJiLVGA68VbD-hUeqw3CE3TX0Kks5rhZRFgaJpZM4MOvSB> .

-- Robert A. Amezquita HHMI Gilliam Fellow | Blavatnik Associate PhD Candidate | Kleinstein & Kaech Labs Yale University | Department of Immunobiology 300 George St, Suite 505 New Haven, CT 06511-6663 Mobile: 858-245-3350 Email: robert.amezquita@yale.edu

kriemo · 2017-03-01T16:24:51Z

I'll put my vote in for using continuous ids rather than repeated per group ids. I'm having a hard time visualizing a use-case for comparing the same cluster ids across different chromosomes or other groupings. I'm also concerned that a user could easily get unexpected output could if they forget what features the ivls were grouped by prior to clustering.

However, we could make these two behaviors configurable with an option such as (unique_ids = TRUE), which would allow for the user to decide.

jayhesselberth assigned jayhesselberth and kriemo Feb 28, 2017

kriemo added a commit to kriemo/valr that referenced this issue Mar 1, 2017

track ids independent of groups in bed_cluster rnabioco#171

5185663

jayhesselberth pushed a commit that referenced this issue Mar 1, 2017

track ids independent of groups in bed_cluster

f5eb89a

* track ids independent of groups in bed_cluster #171 * fixes #171 * sort input by default to avoid additional explict sort

jayhesselberth mentioned this issue Mar 1, 2017

track ids independent of groups in bed_cluster #173

Merged

jayhesselberth added the in progress label Mar 1, 2017

jayhesselberth closed this as completed in #173 Mar 1, 2017

jayhesselberth added a commit that referenced this issue Mar 1, 2017

track ids independent of groups in bed_cluster (#173)

772ee26

* track ids independent of groups in bed_cluster #171 * fixes #171 * sort input by default to avoid additional explict sort

jayhesselberth removed the in progress label Mar 1, 2017

jayhesselberth added a commit that referenced this issue Mar 1, 2017

add contiguous .id to bed_merge example (#171)

d9545f4

- update news - roxygen bump - re-pkg

jayhesselberth mentioned this issue Mar 1, 2017

glyphs representing bed_merge output contain overlapping intervals #175

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster .id output restarts for each chromosome #171

cluster .id output restarts for each chromosome #171

robertamezquita commented Feb 28, 2017

jayhesselberth commented Feb 28, 2017

robertamezquita commented Feb 28, 2017

robertamezquita commented Mar 1, 2017

robertamezquita commented Mar 1, 2017 •

edited

robertamezquita commented Mar 1, 2017

kriemo commented Mar 1, 2017 •

edited

robertamezquita commented Mar 1, 2017 via email

jayhesselberth commented Mar 1, 2017

robertamezquita commented Mar 1, 2017 via email

kriemo commented Mar 1, 2017

cluster .id output restarts for each chromosome #171

cluster .id output restarts for each chromosome #171

Comments

robertamezquita commented Feb 28, 2017

jayhesselberth commented Feb 28, 2017

robertamezquita commented Feb 28, 2017

robertamezquita commented Mar 1, 2017

robertamezquita commented Mar 1, 2017 • edited

robertamezquita commented Mar 1, 2017

kriemo commented Mar 1, 2017 • edited

robertamezquita commented Mar 1, 2017 via email

jayhesselberth commented Mar 1, 2017

robertamezquita commented Mar 1, 2017 via email

kriemo commented Mar 1, 2017

robertamezquita commented Mar 1, 2017 •

edited

kriemo commented Mar 1, 2017 •

edited