-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
additional genome builds #241
Comments
I don't study mouse genome, I will check it. The You can check the data source with:
Currently, the |
Sure. I can provide some data. Did you mean the annotation data or some sample copy number data? |
The annotation data |
The chromsizes and cytobands are available here:
The problem is that centromeres do not exist in the mouse. UCSC lists them as 110000-3000000 for every chromosome. You can see that in: |
@igordot Thanks, I will clean them and add to sigminer. Here goes a question, I also need to add transcript location for mouse genome, but at https://www.gencodegenes.org/mouse/, I only see the Release M25, is it based on mm10? |
I have been using the GENCODE gene releases with mm10. GRCm38 is the equivalent of mm10. Thank you so much for such a quick response. |
@igordot Thanks. It will take about a day to finish this new feature, I will let you know when I finish it. |
@igordot I can not see the centromere for Y chromosome from the UCSC data, is it also okay to set it to 110000-3000000?? Based on the plot from https://www.biorxiv.org/content/10.1101/096297v3.full, it seems the location of centromere for Y chromosome is larger than centromere for X chromosome. |
@igordot Please install the package from GitHub and see how it works for
I cannot make sure there no bug for now, please report to me if you have any questions about it. |
Thank you for the quick fix. I looked a little bit more into the centromere issue. I didn't realize that chrY was an exception. From Soh et al. (as recently as 2014):
This essentially agrees with the pre-print you posted where they place it at around 3.4 Mb. If the centromere is marked at 110000-3000000, it's not substantially misplaced. While the UCSC gap file is missing the centromere for chrY, it lists the short arm as 100000-110000, same as the other chromosomes. By that logic, the centromere starts at 110000. Those reference files have been available for years, so maybe working with those assumptions is reasonable. |
@igordot Thanks, so I will keep centromere locations to 110000-3000000 for all chromosome. If you got more accurate annotation about this, please remind me to refine it. |
I go further checked the gap data, and find that for all chromosomes
So the length of short arm is just 10Kb in UCSC annotation, smaller than telomere? I need to tell you that in my code, I split the chromosome into p and q based on the centromere location, so for mouse, I will treat location < 110000 as p arm (not between 100000 and 110000), and same for q. |
I think that's fine. If there is a labeled "short arm" region, you might as well call it p. From MGI:
For the purpose of sigminer, do these annotations have any impact on the analysis or are they only for visualization? |
They will have effects on analysis and visualization in two places.
In my view, the impact is acceptable. But I want to keep it accurate if I could, I am still confused, should I reset the Y to 3.5Mb to 86.0 Mb based on
|
I think those coordinates are arm-level (not chromosome-level). So 3.5Mb is from the start of the short arm, but 86Mb is from the start of the long arm (end of the chromosome, I think). The full chromosome is 89.6Mb based on that study, so that would make sense (3.5Mb + 90kb + 86Mb). The mm10 chrY is 91.7Mb, so the coordinates are not directly comparable. Regardless, I would just stick with the UCSC definitions for now. They are widely used. The discrepancy is about 3Mb, so a very small fraction of the total genome. |
Thanks for your valuable comments |
Is it possible to use genome builds other than hg19 or hg38? Specifically, I am hoping to use mouse mm10.
If you modify
get_genome_annotation()
, would that be sufficient? That function retrievescentromeres
,chromsize
, andcytobands
, but I couldn't find how exactly those were generated.The text was updated successfully, but these errors were encountered: