New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster .id output restarts for each chromosome #171
Comments
So in the example below, you are suggesting the library(valr)
library(tidyverse)
x <- tibble::tribble(
~chrom, ~start, ~end,
'chr1', 1, 100,
'chr1', 150, 250,
'chr2', 1, 100,
'chr2', 150, 250
)
bed_cluster(x)
#> # A tibble: 4 × 4
#> chrom start end .id
#> <chr> <dbl> <dbl> <int>
#> 1 chr1 1 100 1
#> 2 chr1 150 250 2
#> 3 chr2 1 100 1
#> 4 chr2 150 250 2 I think we can just ditch the |
Exactly. Yes, I think that would make it a very easy fix! |
Awesome, also just wanted to say thanks for developing such a powerful native R toolset for working with genomic intervals!! Its a pain to use bedtools via the |
Hmm seems like the latest version is returning the
|
The issue seems to be in |
I'm getting the correct output using 772ee26. Can you try reinstalling from the most recent commit on master? Thanks devtools::install_github("rnabioco/valr")
library(valr)
x <- tibble::tribble(
~chrom, ~start, ~end,
"chr1", 100, 200,
"chr1", 180, 250,
"chr1", 250, 500,
"chr1", 501, 1000
)
bed_cluster(x)
#> # A tibble: 4 × 4
#> chrom start end .id
#> <chr> <dbl> <dbl> <int>
#> 1 chr1 100 200 1
#> 2 chr1 180 250 1
#> 3 chr1 250 500 1
#> 4 chr1 501 1000 2 |
Thanks! Must've been a mixup in my environment...sorry about that. Might
add to the example a "chr2" just to show the continuous clustering .id
numbers!
…On Wed, Mar 1, 2017 at 8:30 AM, kriemo ***@***.***> wrote:
I'm getting the correct output using 772ee26
<772ee26>.
Can you try reinstalling the most recent github version?
Thanks
devtools::install_github("rnabioco/valr")
library(valr)x <- tibble::tribble(
~chrom, ~start, ~end,
"chr1", 100, 200,
"chr1", 180, 250,
"chr1", 250, 500,
"chr1", 501, 1000
)
bed_cluster(x)#> # A tibble: 4 × 4#> chrom start end .id#> <chr> <dbl> <dbl> <int>#> 1 chr1 100 200 1#> 2 chr1 180 250 1#> 3 chr1 250 500 1#> 4 chr1 501 1000 2
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#171 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ACU2LdHExGCcCJ3_3jW5AUnO6uHo_lX4ks5rhXLsgaJpZM4MOvSB>
.
--
Robert A. Amezquita
HHMI Gilliam Fellow | Blavatnik Associate
PhD Candidate | Kleinstein & Kaech Labs
Yale University | Department of Immunobiology
300 George St, Suite 505
New Haven, CT 06511-6663
Mobile: 858-245-3350 <(858)%20245-3350>
Email: robert.amezquita@yale.edu
|
- update news - roxygen bump - re-pkg
I think it is worth revisiting this. Having the numbering restart for each group is the input could be valuable, and is conceptually cleaner. And the numbering is easily combined with If we revert, then you would simply have bed_cluster(x) %>%
group_by(chrom, .id) %>%
summarize(...) This is also related to a rewrite of the Rcpp side where I am considering making use of # implicitly grouped by chrom in current version
bed_intersect(x, y)
# explicit grouping by chrom
x <- group_by(x, chrom)
y <- group_by(y, chrom)
bed_intersect(x, y) |
True, my thought was for exporting/communication and summarisation it might
be simpler to have continuous IDs through the whole genome, but programming
wise yes, it would be a simple grouping by both chrom and .id.
I'm not sure if either solution is significantly more improved than the
other, its more of a philosophical difference in this case (I don't think
comparing across chromosomes the first cluster for instance would prove
useful for instance, but agree that conceptually its cleaner). In this
framework, I realize that grouping by chrom for operations is a pretty
common structure, so it might make sense then to go back to the original
version with numbers restarting per chromosome.
…On Wed, Mar 1, 2017 at 10:52 AM, Jay Hesselberth ***@***.***> wrote:
I think it is worth revisiting this. Having the numbering restart for each
group is the input could be valuable, and is conceptually cleaner. And the
numbering is easily combined with group_by for subsequent analyses. I
realize this isn't bedtools behavior, but some of its behaviors can be
improved.
If we revert, then you would simply have chrom and .id groups:
bed_cluster(x) %>%
group_by(chrom, .id) %>%
summarize(...)
This is also related to a rewrite of the Rcpp side where I am considering
making use of group_by operations more explicit.
# implicitly grouped by chrom in current version
bed_intersect(x, y)
# explicit grouping by chromx <- group_by(x, chrom)y <- group_by(y, chrom)
bed_intersect(x, y)
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#171 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ACU2LSgJiLVGA68VbD-hUeqw3CE3TX0Kks5rhZRFgaJpZM4MOvSB>
.
--
Robert A. Amezquita
HHMI Gilliam Fellow | Blavatnik Associate
PhD Candidate | Kleinstein & Kaech Labs
Yale University | Department of Immunobiology
300 George St, Suite 505
New Haven, CT 06511-6663
Mobile: 858-245-3350
Email: robert.amezquita@yale.edu
|
I'll put my vote in for using continuous ids rather than repeated per group ids. I'm having a hard time visualizing a use-case for comparing the same cluster ids across different chromosomes or other groupings. I'm also concerned that a user could easily get unexpected output could if they forget what features the ivls were grouped by prior to clustering. However, we could make these two behaviors configurable with an option such as (unique_ids = TRUE), which would allow for the user to decide. |
This may be a philosophical issue more than coding, but for
bed_cluster
, should the.id
restart for each chromosome? Ideally, I would think to replicate output from other tools such as bedtools it would make more sense to continue sequentially through the chromosomes rather than restarting.The text was updated successfully, but these errors were encountered: