Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upadditional genome builds #241
Comments
|
I don't study mouse genome, I will check it. The You can check the data source with:
Currently, the |
|
Sure. I can provide some data. Did you mean the annotation data or some sample copy number data? |
|
The annotation data |
|
The chromsizes and cytobands are available here:
The problem is that centromeres do not exist in the mouse. UCSC lists them as 110000-3000000 for every chromosome. You can see that in: |
|
@igordot Thanks, I will clean them and add to sigminer. Here goes a question, I also need to add transcript location for mouse genome, but at https://www.gencodegenes.org/mouse/, I only see the Release M25, is it based on mm10? |
|
I have been using the GENCODE gene releases with mm10. GRCm38 is the equivalent of mm10. Thank you so much for such a quick response. |
|
@igordot Thanks. It will take about a day to finish this new feature, I will let you know when I finish it. |
|
@igordot I can not see the centromere for Y chromosome from the UCSC data, is it also okay to set it to 110000-3000000?? Based on the plot from https://www.biorxiv.org/content/10.1101/096297v3.full, it seems the location of centromere for Y chromosome is larger than centromere for X chromosome. |
|
@igordot Please install the package from GitHub and see how it works for
I cannot make sure there no bug for now, please report to me if you have any questions about it. |
|
Thank you for the quick fix. I looked a little bit more into the centromere issue. I didn't realize that chrY was an exception. From Soh et al. (as recently as 2014):
This essentially agrees with the pre-print you posted where they place it at around 3.4 Mb. If the centromere is marked at 110000-3000000, it's not substantially misplaced. While the UCSC gap file is missing the centromere for chrY, it lists the short arm as 100000-110000, same as the other chromosomes. By that logic, the centromere starts at 110000. Those reference files have been available for years, so maybe working with those assumptions is reasonable. |
|
@igordot Thanks, so I will keep centromere locations to 110000-3000000 for all chromosome. If you got more accurate annotation about this, please remind me to refine it. |
|
I go further checked the gap data, and find that for all chromosomes
So the length of short arm is just 10Kb in UCSC annotation, smaller than telomere? I need to tell you that in my code, I split the chromosome into p and q based on the centromere location, so for mouse, I will treat location < 110000 as p arm (not between 100000 and 110000), and same for q. |
|
I think that's fine. If there is a labeled "short arm" region, you might as well call it p. From MGI:
For the purpose of sigminer, do these annotations have any impact on the analysis or are they only for visualization? |
|
They will have effects on analysis and visualization in two places.
In my view, the impact is acceptable. But I want to keep it accurate if I could, I am still confused, should I reset the Y to 3.5Mb to 86.0 Mb based on
|
I think those coordinates are arm-level (not chromosome-level). So 3.5Mb is from the start of the short arm, but 86Mb is from the start of the long arm (end of the chromosome, I think). The full chromosome is 89.6Mb based on that study, so that would make sense (3.5Mb + 90kb + 86Mb). The mm10 chrY is 91.7Mb, so the coordinates are not directly comparable. Regardless, I would just stick with the UCSC definitions for now. They are widely used. The discrepancy is about 3Mb, so a very small fraction of the total genome. |
|
Thanks for your valuable comments |



Is it possible to use genome builds other than hg19 or hg38? Specifically, I am hoping to use mouse mm10.
If you modify
get_genome_annotation(), would that be sufficient? That function retrievescentromeres,chromsize, andcytobands, but I couldn't find how exactly those were generated.