Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster .id output restarts for each chromosome #171

Closed
robertamezquita opened this issue Feb 28, 2017 · 10 comments · Fixed by #173
Closed

cluster .id output restarts for each chromosome #171

robertamezquita opened this issue Feb 28, 2017 · 10 comments · Fixed by #173
Assignees

Comments

@robertamezquita
Copy link

This may be a philosophical issue more than coding, but for bed_cluster, should the .id restart for each chromosome? Ideally, I would think to replicate output from other tools such as bedtools it would make more sense to continue sequentially through the chromosomes rather than restarting.

@jayhesselberth
Copy link
Member

So in the example below, you are suggesting the .ids would 1,2,3,4 instead of 1,2,1,2? I guess I didn't appreciate that is the bedtools default, which we have tried to adhere to, to avoid surprises (like this one).

library(valr)
library(tidyverse)

x <- tibble::tribble(
  ~chrom, ~start, ~end,
  'chr1', 1, 100,
  'chr1', 150, 250,
  'chr2', 1, 100,
  'chr2', 150, 250
)

bed_cluster(x)
#> # A tibble: 4 × 4
#>   chrom start   end   .id
#>   <chr> <dbl> <dbl> <int>
#> 1  chr1     1   100     1
#> 2  chr1   150   250     2
#> 3  chr2     1   100     1
#> 4  chr2   150   250     2

I think we can just ditch the group_by operation at: https://github.com/rnabioco/valr/blob/master/R/bed_cluster.r#L48

@robertamezquita
Copy link
Author

Exactly. Yes, I think that would make it a very easy fix!

kriemo added a commit to kriemo/valr that referenced this issue Mar 1, 2017
jayhesselberth pushed a commit that referenced this issue Mar 1, 2017
* track ids independent of groups in bed_cluster #171

* fixes #171 

* sort input by default to avoid additional explict sort
jayhesselberth added a commit that referenced this issue Mar 1, 2017
* track ids independent of groups in bed_cluster #171

* fixes #171 

* sort input by default to avoid additional explict sort
@robertamezquita
Copy link
Author

Awesome, also just wanted to say thanks for developing such a powerful native R toolset for working with genomic intervals!! Its a pain to use bedtools via the system interface, and makes code portability difficult. Plus, its not tidyverse friendly! Thank you for all the hard work into making this awesome!

@robertamezquita
Copy link
Author

robertamezquita commented Mar 1, 2017

Hmm seems like the latest version is returning the start column as the .id. Just pulled the newest master with #172 merged in and here's the output using the example code snippet in the function of bed_cluster

x <- tibble::tribble(
                 ~chrom, ~start, ~end,
                 "chr1", 100,  200,
                 "chr1", 180,  250,
                 "chr1", 250,  500,
                 "chr1", 501,  1000
             )

bed_cluster(x)

 # A tibble: 4 × 4                                                                                                                                                                                                 
   chrom start   end   .id
   <chr> <dbl> <dbl> <int>
 1  chr1   100   200   100
 2  chr1   180   250   100
 3  chr1   250   500   100
 4  chr1   501  1000   501

@robertamezquita
Copy link
Author

The issue seems to be in valr:::merge_impl as thats whats producing the faulty .id_merge column

@kriemo
Copy link
Member

kriemo commented Mar 1, 2017

I'm getting the correct output using 772ee26. Can you try reinstalling from the most recent commit on master?

Thanks

devtools::install_github("rnabioco/valr")
library(valr)
x <- tibble::tribble(
    ~chrom, ~start, ~end,
    "chr1", 100,  200,
    "chr1", 180,  250,
    "chr1", 250,  500,
    "chr1", 501,  1000
)

bed_cluster(x)
#> # A tibble: 4 × 4
#>  chrom start   end   .id
#>  <chr> <dbl> <dbl> <int>
#> 1  chr1   100   200     1
#> 2  chr1   180   250     1
#> 3  chr1   250   500     1
#> 4  chr1   501  1000     2

@robertamezquita
Copy link
Author

robertamezquita commented Mar 1, 2017 via email

@jayhesselberth
Copy link
Member

I think it is worth revisiting this. Having the numbering restart for each group is the input could be valuable, and is conceptually cleaner. And the numbering is easily combined with group_by for subsequent analyses. I realize this isn't bedtools behavior, but some of its behaviors can be improved.

If we revert, then you would simply have chrom and .id groups:

bed_cluster(x) %>%
  group_by(chrom, .id) %>%
  summarize(...)

This is also related to a rewrite of the Rcpp side where I am considering making use of group_by operations more explicit.

# implicitly grouped by chrom in current version
bed_intersect(x, y)

# explicit grouping by chrom
x <- group_by(x, chrom)
y <- group_by(y, chrom)

bed_intersect(x, y)

@robertamezquita
Copy link
Author

robertamezquita commented Mar 1, 2017 via email

@kriemo
Copy link
Member

kriemo commented Mar 1, 2017

I'll put my vote in for using continuous ids rather than repeated per group ids. I'm having a hard time visualizing a use-case for comparing the same cluster ids across different chromosomes or other groupings. I'm also concerned that a user could easily get unexpected output could if they forget what features the ivls were grouped by prior to clustering.

However, we could make these two behaviors configurable with an option such as (unique_ids = TRUE), which would allow for the user to decide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants