-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
transcript list does not match prescored transcripts for hg38 #34
Comments
Your assumption is correct. The precomputed scores were produced only for GRCh37, and they were then lifted over to GRCh38. However, the files annotations/grch37.txt and annotations/grch38.txt are not lifted over versions of each other. We downloaded both of those from the UCSC genome browser, and filtered out some genes from annotations/grch38.txt as they had different number of exons/transcript lengths in the two builds. So, when you use the grch38 option, if you see a score, it will match the score on the precomputed lifted over files, but the lifted over files might contain a few more transcripts. |
Just so you know, this is not totally correct. We found that you included genes on the Y chromosomal pseudoautosomal region in the annotations file that are not in the prescored file (only prescored for the X coordinates). Furthermore we found that the annotations for GRCh37 includes the gene SCGB1C2 on chromosome 11 while on GRCh38 it is on chromosome 17. While the latter is in line with Gencode, the GRCh37 position is totally off and can only be partially explained by the gene SCGB1C1 being at a mostly overlapping position. However, due to the liftover, in the GRCh38 prescored file SCGB1C2 variants are now all located on chromosome 11 too. I would recommend to remove SCGB1C2 and the pseudoautosomal region genes from the SpliceAI GRCh38 annotations. However, I am not 100% sure SCGB1C2 is the only case of such a coordinate swap since I have no idea how it may have originated. |
Thanks for letting me know, I got to the bottom of this issue. The GENCODE annotations are originally in hg38 and the hg19 version is obtained via hg38ToHg19.over.chain, and the SpliceAI scores are originally in hg19 and the hg38 version is obtained via hg19ToHg38.over.chain. For this gene, the two liftovers are not reversible unfortunately. chr17:137525 in hg38 goes to chr11:193034 in hg19, but chr11:193034 in hg19 seems to stay at chr11:193034 in hg38 as well. This issue seems to affect 17 genes in total: I'll take these out from the annotations file for the sake of consistency. |
Awesome, that totally makes sense! |
Hi ! Any idea why hg38 ANNOVAR annotated variants (gene) do not match the TxDb.Hsapiens.UCSC.hg38.knownGene ? I used hg 38 to annotate my.vcf file in ANNOVAR however when I used this my_filtered.vcf to plot variants using lollyplot from trackViewer package. I get different gene name. The latter uses TxDb.Hsapiens.UCSC.hg38.knownGene db. Annovar link - https://annovar.openbioinformatics.org/en/latest/ Any help will be super great !! |
I noticed that for GRCh38 the prescored file does contain more transcripts than annotated via the script. Therefore variants are annotated differently.
i.e.
tabix spliceai_scores.raw.snv.hg38.vcf.gz 17:7013943-7013943
results for me in this output:while
annotations/grch38.txt
only contains the transcript coordinates forRNASEK-C17orf49
and not forRNASEK
andAC040977.1
. However, the latter two are found inannotations/grch37.txt
. I assume this is because you calculated scores only for GRCh37 did liftover from there. Is there a reason whyRNASEK
andAC040977.1
(and apparently many others) are not in the GRCh38 annotation? Those are active genes in Ensembl so I assume they are relatively recently added genes that somehow you only added to the GRCh37 list and not GRCh38?The text was updated successfully, but these errors were encountered: