additional genome builds #241

igordot · 2020-06-22T13:28:51Z

Is it possible to use genome builds other than hg19 or hg38? Specifically, I am hoping to use mouse mm10.

If you modify get_genome_annotation(), would that be sufficient? That function retrieves centromeres, chromsize, and cytobands, but I couldn't find how exactly those were generated.

The text was updated successfully, but these errors were encountered:

ShixiangWang · 2020-06-23T01:53:15Z

I don't study mouse genome, I will check it. The get_genome_annotation() is a very important part, but all code need to be checked before using mouse genome.

You can check the data source with:

> ?chromsize.hg38
> ?centromeres.hg38
> ?cytobands.hg38

Currently, the cytobands is not necessary, chromsize and centromeres are necessary. Could you provide the data for mouse with same data format, so I can support this as soon as possible.

igordot · 2020-06-23T01:56:26Z

Sure. I can provide some data. Did you mean the annotation data or some sample copy number data?

ShixiangWang · 2020-06-23T01:58:37Z

The annotation data chromsize.mm10, centromeres.mm10, etc. You can find them from UCSC or other reference database.

igordot · 2020-06-23T02:21:28Z

The chromsizes and cytobands are available here:

The problem is that centromeres do not exist in the mouse. UCSC lists them as 110000-3000000 for every chromosome. You can see that in:

https://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/gap.txt.gz

ShixiangWang · 2020-06-23T02:25:40Z

@igordot Thanks, I will clean them and add to sigminer.

Here goes a question, I also need to add transcript location for mouse genome, but at https://www.gencodegenes.org/mouse/, I only see the Release M25, is it based on mm10?

igordot · 2020-06-23T02:28:44Z

I have been using the GENCODE gene releases with mm10. GRCm38 is the equivalent of mm10.

Thank you so much for such a quick response.

ShixiangWang · 2020-06-23T02:31:22Z

@igordot Thanks. It will take about a day to finish this new feature, I will let you know when I finish it.

ShixiangWang · 2020-06-23T06:40:30Z

@igordot I can not see the centromere for Y chromosome from the UCSC data, is it also okay to set it to 110000-3000000??

Based on the plot from https://www.biorxiv.org/content/10.1101/096297v3.full, it seems the location of centromere for Y chromosome is larger than centromere for X chromosome.

ShixiangWang · 2020-06-23T06:50:02Z

@igordot Please install the package from GitHub and see how it works for mm10 genome.

remotes::install_github("ShixiangWang/sigminer", dependencies = TRUE)

I cannot make sure there no bug for now, please report to me if you have any questions about it.

igordot · 2020-06-23T14:44:27Z

Thank you for the quick fix.

I looked a little bit more into the centromere issue. I didn't realize that chrY was an exception. From Soh et al. (as recently as 2014):

We obtained the complete sequence of the mouse Y centromere (Figure S1). Consisting of 90 kb of satellite repeats, the centromere is the only heterochromatic sequence (defined as satellite sequence) that we identified in the entire mouse MSY. ... It is located between 3.5 Mb of short-arm and 86.0 Mb of long-arm sequence, confirming that the mouse Y is the only acrocentric chromosome among all the other telocentric mouse chromosomes (Ford, 1966, McLaren et al., 1988, Roberts et al., 1988).

This essentially agrees with the pre-print you posted where they place it at around 3.4 Mb. If the centromere is marked at 110000-3000000, it's not substantially misplaced.

While the UCSC gap file is missing the centromere for chrY, it lists the short arm as 100000-110000, same as the other chromosomes. By that logic, the centromere starts at 110000. Those reference files have been available for years, so maybe working with those assumptions is reasonable.

ShixiangWang · 2020-06-23T15:57:11Z

@igordot Thanks, so I will keep centromere locations to 110000-3000000 for all chromosome. If you got more accurate annotation about this, please remind me to refine it.

ShixiangWang · 2020-06-23T16:09:04Z

I go further checked the gap data, and find that for all chromosomes

telomere is located at 0-100000
short arm is located at 100000-110000

So the length of short arm is just 10Kb in UCSC annotation, smaller than telomere?

I need to tell you that in my code, I split the chromosome into p and q based on the centromere location, so for mouse, I will treat location < 110000 as p arm (not between 100000 and 110000), and same for q.

igordot · 2020-06-23T16:23:47Z

I think that's fine. If there is a labeled "short arm" region, you might as well call it p.

From MGI:

As mouse chromosomes are all acrocentric, with the exception of Chr Y, the p and q arm designations standard for human chromosomes are not used. ... Because mouse autosomes and the X Chromosome are acrocentric, they do not have a short arm other than a telomere proximal to the centromere. Therefore, most rearrangements in mouse chromosomes involve breaks in the long arm (q arm). In mouse, Chr Y has both a p and q arm.

For the purpose of sigminer, do these annotations have any impact on the analysis or are they only for visualization?

ShixiangWang · 2020-06-24T01:56:38Z

They will have effects on analysis and visualization in two places.

In the read_copynumber() step, it will annotate each segment to p arm or q arm based on the centromere location. See https://shixiangwang.github.io/sigminer-doc/cnobject.html#distribution. You can total ignore it if you don't care the data or plot.
In the sig_tally() step, the BPArm feature is based on the centromere location, it will calculate the number of breakpoint in each arm (p and q), so if the annotation is inacurrate, the result value may be also a little inaccurate.

In my view, the impact is acceptable. But I want to keep it accurate if I could, I am still confused, should I reset the Y to 3.5Mb to 86.0 Mb based on

It is located between 3.5 Mb of short-arm and 86.0 Mb of long-arm sequence

igordot · 2020-06-24T20:20:19Z

In my view, the impact is acceptable. But I want to keep it accurate if I could, I am still confused, should I reset the Y to 3.5Mb to 86.0 Mb based on

I think those coordinates are arm-level (not chromosome-level). So 3.5Mb is from the start of the short arm, but 86Mb is from the start of the long arm (end of the chromosome, I think). The full chromosome is 89.6Mb based on that study, so that would make sense (3.5Mb + 90kb + 86Mb).

The mm10 chrY is 91.7Mb, so the coordinates are not directly comparable.

Regardless, I would just stick with the UCSC definitions for now. They are widely used. The discrepancy is about 3Mb, so a very small fraction of the total genome.

ShixiangWang · 2020-06-25T00:58:18Z

Thanks for your valuable comments

ShixiangWang added the enhancement New feature or request label Jun 23, 2020

ShixiangWang assigned igordot and ShixiangWang Jun 23, 2020

ShixiangWang added a commit that referenced this issue Jun 23, 2020

✨ Support mm10 #241

4de04ed

ShixiangWang closed this as completed Jun 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

additional genome builds #241

additional genome builds #241

igordot commented Jun 22, 2020

ShixiangWang commented Jun 23, 2020

igordot commented Jun 23, 2020

ShixiangWang commented Jun 23, 2020 •

edited

Loading

igordot commented Jun 23, 2020

ShixiangWang commented Jun 23, 2020

igordot commented Jun 23, 2020

ShixiangWang commented Jun 23, 2020 •

edited

Loading

ShixiangWang commented Jun 23, 2020

ShixiangWang commented Jun 23, 2020

igordot commented Jun 23, 2020

ShixiangWang commented Jun 23, 2020

ShixiangWang commented Jun 23, 2020

igordot commented Jun 23, 2020

ShixiangWang commented Jun 24, 2020

igordot commented Jun 24, 2020

ShixiangWang commented Jun 25, 2020

additional genome builds #241

additional genome builds #241

Comments

igordot commented Jun 22, 2020

ShixiangWang commented Jun 23, 2020

igordot commented Jun 23, 2020

ShixiangWang commented Jun 23, 2020 • edited Loading

igordot commented Jun 23, 2020

ShixiangWang commented Jun 23, 2020

igordot commented Jun 23, 2020

ShixiangWang commented Jun 23, 2020 • edited Loading

ShixiangWang commented Jun 23, 2020

ShixiangWang commented Jun 23, 2020

igordot commented Jun 23, 2020

ShixiangWang commented Jun 23, 2020

ShixiangWang commented Jun 23, 2020

igordot commented Jun 23, 2020

ShixiangWang commented Jun 24, 2020

igordot commented Jun 24, 2020

ShixiangWang commented Jun 25, 2020

ShixiangWang commented Jun 23, 2020 •

edited

Loading

ShixiangWang commented Jun 23, 2020 •

edited

Loading