Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

additional genome builds #241

Closed
igordot opened this issue Jun 22, 2020 · 16 comments
Closed

additional genome builds #241

igordot opened this issue Jun 22, 2020 · 16 comments
Assignees
Labels
enhancement New feature or request

Comments

@igordot
Copy link

igordot commented Jun 22, 2020

Is it possible to use genome builds other than hg19 or hg38? Specifically, I am hoping to use mouse mm10.

If you modify get_genome_annotation(), would that be sufficient? That function retrieves centromeres, chromsize, and cytobands, but I couldn't find how exactly those were generated.

@ShixiangWang
Copy link
Owner

I don't study mouse genome, I will check it. The get_genome_annotation() is a very important part, but all code need to be checked before using mouse genome.

You can check the data source with:

> ?chromsize.hg38
> ?centromeres.hg38
> ?cytobands.hg38

Currently, the cytobands is not necessary, chromsize and centromeres are necessary. Could you provide the data for mouse with same data format, so I can support this as soon as possible.

@ShixiangWang ShixiangWang added the enhancement New feature or request label Jun 23, 2020
@igordot
Copy link
Author

igordot commented Jun 23, 2020

Sure. I can provide some data. Did you mean the annotation data or some sample copy number data?

@ShixiangWang
Copy link
Owner

ShixiangWang commented Jun 23, 2020

The annotation data chromsize.mm10, centromeres.mm10, etc. You can find them from UCSC or other reference database.

@igordot
Copy link
Author

igordot commented Jun 23, 2020

The chromsizes and cytobands are available here:

The problem is that centromeres do not exist in the mouse. UCSC lists them as 110000-3000000 for every chromosome. You can see that in:

@ShixiangWang
Copy link
Owner

@igordot Thanks, I will clean them and add to sigminer.

Here goes a question, I also need to add transcript location for mouse genome, but at https://www.gencodegenes.org/mouse/, I only see the Release M25, is it based on mm10?

@igordot
Copy link
Author

igordot commented Jun 23, 2020

I have been using the GENCODE gene releases with mm10. GRCm38 is the equivalent of mm10.

Thank you so much for such a quick response.

@ShixiangWang
Copy link
Owner

ShixiangWang commented Jun 23, 2020

@igordot Thanks. It will take about a day to finish this new feature, I will let you know when I finish it.

@ShixiangWang
Copy link
Owner

@igordot I can not see the centromere for Y chromosome from the UCSC data, is it also okay to set it to 110000-3000000??

Based on the plot from https://www.biorxiv.org/content/10.1101/096297v3.full, it seems the location of centromere for Y chromosome is larger than centromere for X chromosome.

image

ShixiangWang added a commit that referenced this issue Jun 23, 2020
@ShixiangWang
Copy link
Owner

@igordot Please install the package from GitHub and see how it works for mm10 genome.

remotes::install_github("ShixiangWang/sigminer", dependencies = TRUE)

I cannot make sure there no bug for now, please report to me if you have any questions about it.

@igordot
Copy link
Author

igordot commented Jun 23, 2020

Thank you for the quick fix.

I looked a little bit more into the centromere issue. I didn't realize that chrY was an exception. From Soh et al. (as recently as 2014):

We obtained the complete sequence of the mouse Y centromere (Figure S1). Consisting of 90 kb of satellite repeats, the centromere is the only heterochromatic sequence (defined as satellite sequence) that we identified in the entire mouse MSY. ... It is located between 3.5 Mb of short-arm and 86.0 Mb of long-arm sequence, confirming that the mouse Y is the only acrocentric chromosome among all the other telocentric mouse chromosomes (Ford, 1966, McLaren et al., 1988, Roberts et al., 1988).

This essentially agrees with the pre-print you posted where they place it at around 3.4 Mb. If the centromere is marked at 110000-3000000, it's not substantially misplaced.

While the UCSC gap file is missing the centromere for chrY, it lists the short arm as 100000-110000, same as the other chromosomes. By that logic, the centromere starts at 110000. Those reference files have been available for years, so maybe working with those assumptions is reasonable.

@ShixiangWang
Copy link
Owner

@igordot Thanks, so I will keep centromere locations to 110000-3000000 for all chromosome. If you got more accurate annotation about this, please remind me to refine it.

@ShixiangWang
Copy link
Owner

I go further checked the gap data, and find that for all chromosomes

  • telomere is located at 0-100000
  • short arm is located at 100000-110000

So the length of short arm is just 10Kb in UCSC annotation, smaller than telomere?

I need to tell you that in my code, I split the chromosome into p and q based on the centromere location, so for mouse, I will treat location < 110000 as p arm (not between 100000 and 110000), and same for q.

image

image

@igordot
Copy link
Author

igordot commented Jun 23, 2020

I think that's fine. If there is a labeled "short arm" region, you might as well call it p.

From MGI:

As mouse chromosomes are all acrocentric, with the exception of Chr Y, the p and q arm designations standard for human chromosomes are not used. ... Because mouse autosomes and the X Chromosome are acrocentric, they do not have a short arm other than a telomere proximal to the centromere. Therefore, most rearrangements in mouse chromosomes involve breaks in the long arm (q arm). In mouse, Chr Y has both a p and q arm.

For the purpose of sigminer, do these annotations have any impact on the analysis or are they only for visualization?

@ShixiangWang
Copy link
Owner

They will have effects on analysis and visualization in two places.

  • In the read_copynumber() step, it will annotate each segment to p arm or q arm based on the centromere location. See https://shixiangwang.github.io/sigminer-doc/cnobject.html#distribution. You can total ignore it if you don't care the data or plot.

  • In the sig_tally() step, the BPArm feature is based on the centromere location, it will calculate the number of breakpoint in each arm (p and q), so if the annotation is inacurrate, the result value may be also a little inaccurate.

In my view, the impact is acceptable. But I want to keep it accurate if I could, I am still confused, should I reset the Y to 3.5Mb to 86.0 Mb based on

It is located between 3.5 Mb of short-arm and 86.0 Mb of long-arm sequence

@igordot
Copy link
Author

igordot commented Jun 24, 2020

In my view, the impact is acceptable. But I want to keep it accurate if I could, I am still confused, should I reset the Y to 3.5Mb to 86.0 Mb based on

I think those coordinates are arm-level (not chromosome-level). So 3.5Mb is from the start of the short arm, but 86Mb is from the start of the long arm (end of the chromosome, I think). The full chromosome is 89.6Mb based on that study, so that would make sense (3.5Mb + 90kb + 86Mb).

The mm10 chrY is 91.7Mb, so the coordinates are not directly comparable.

Regardless, I would just stick with the UCSC definitions for now. They are widely used. The discrepancy is about 3Mb, so a very small fraction of the total genome.

@ShixiangWang
Copy link
Owner

Thanks for your valuable comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants