Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transcript list does not match prescored transcripts for hg38 #34

Closed
aerval opened this issue Feb 3, 2020 · 5 comments
Closed

transcript list does not match prescored transcripts for hg38 #34

aerval opened this issue Feb 3, 2020 · 5 comments

Comments

@aerval
Copy link

aerval commented Feb 3, 2020

I noticed that for GRCh38 the prescored file does contain more transcripts than annotated via the script. Therefore variants are annotated differently.

i.e. tabix spliceai_scores.raw.snv.hg38.vcf.gz 17:7013943-7013943 results for me in this output:

17      7013943 .       A       C       .       .       SpliceAI=C|AC040977.1|0.00|0.00|0.00|0.00|3|2|19|2
17      7013943 .       A       C       .       .       SpliceAI=C|RNASEK-C17orf49|0.00|0.00|0.00|0.00|-5|36|5|-6
17      7013943 .       A       C       .       .       SpliceAI=C|RNASEK|0.00|0.00|0.00|0.00|10|36|10|9
17      7013943 .       A       G       .       .       SpliceAI=G|AC040977.1|0.00|0.00|0.00|0.00|42|2|-10|2
17      7013943 .       A       G       .       .       SpliceAI=G|RNASEK-C17orf49|0.00|0.00|0.03|0.00|9|36|9|-6
17      7013943 .       A       G       .       .       SpliceAI=G|RNASEK|0.00|0.00|0.00|0.00|1|36|9|-6
17      7013943 .       A       T       .       .       SpliceAI=T|AC040977.1|0.00|0.00|0.00|0.00|42|2|-10|2
17      7013943 .       A       T       .       .       SpliceAI=T|RNASEK-C17orf49|0.00|0.00|0.11|0.00|10|36|-6|9
17      7013943 .       A       T       .       .       SpliceAI=T|RNASEK|0.00|0.00|0.01|0.00|10|36|-6|9

while annotations/grch38.txt only contains the transcript coordinates for RNASEK-C17orf49 and not for RNASEK and AC040977.1. However, the latter two are found in annotations/grch37.txt. I assume this is because you calculated scores only for GRCh37 did liftover from there. Is there a reason why RNASEK and AC040977.1 (and apparently many others) are not in the GRCh38 annotation? Those are active genes in Ensembl so I assume they are relatively recently added genes that somehow you only added to the GRCh37 list and not GRCh38?

@kishorejaganathan
Copy link
Contributor

Your assumption is correct. The precomputed scores were produced only for GRCh37, and they were then lifted over to GRCh38. However, the files annotations/grch37.txt and annotations/grch38.txt are not lifted over versions of each other. We downloaded both of those from the UCSC genome browser, and filtered out some genes from annotations/grch38.txt as they had different number of exons/transcript lengths in the two builds. So, when you use the grch38 option, if you see a score, it will match the score on the precomputed lifted over files, but the lifted over files might contain a few more transcripts.

@aerval
Copy link
Author

aerval commented Mar 4, 2020

Just so you know, this is not totally correct. We found that you included genes on the Y chromosomal pseudoautosomal region in the annotations file that are not in the prescored file (only prescored for the X coordinates). Furthermore we found that the annotations for GRCh37 includes the gene SCGB1C2 on chromosome 11 while on GRCh38 it is on chromosome 17. While the latter is in line with Gencode, the GRCh37 position is totally off and can only be partially explained by the gene SCGB1C1 being at a mostly overlapping position. However, due to the liftover, in the GRCh38 prescored file SCGB1C2 variants are now all located on chromosome 11 too.

I would recommend to remove SCGB1C2 and the pseudoautosomal region genes from the SpliceAI GRCh38 annotations. However, I am not 100% sure SCGB1C2 is the only case of such a coordinate swap since I have no idea how it may have originated.

@kishorejaganathan
Copy link
Contributor

Thanks for letting me know, I got to the bottom of this issue. The GENCODE annotations are originally in hg38 and the hg19 version is obtained via hg38ToHg19.over.chain, and the SpliceAI scores are originally in hg19 and the hg38 version is obtained via hg19ToHg38.over.chain. For this gene, the two liftovers are not reversible unfortunately. chr17:137525 in hg38 goes to chr11:193034 in hg19, but chr11:193034 in hg19 seems to stay at chr11:193034 in hg38 as well.

This issue seems to affect 17 genes in total:
PPIAL4E
MUC12
OR4C46
OR4A8
OR4A5
IGHV1-69-2
RP11-294C11.4
RP11-294C11.2
POTEB
SCGB1C2
CBSL
U2AF1L5
CH507-152C13.3
SMIM11B
OR11H1
SPANXB1
OPN1MW3

I'll take these out from the annotations file for the sake of consistency.

@aerval
Copy link
Author

aerval commented Mar 6, 2020

Awesome, that totally makes sense!

@smeeta
Copy link

smeeta commented Nov 18, 2021

Hi !

Any idea why hg38 ANNOVAR annotated variants (gene) do not match the TxDb.Hsapiens.UCSC.hg38.knownGene ?

I used hg 38 to annotate my.vcf file in ANNOVAR however when I used this my_filtered.vcf to plot variants using lollyplot from trackViewer package. I get different gene name. The latter uses TxDb.Hsapiens.UCSC.hg38.knownGene db.
Example I have 5 variants in gene PCDHA7 but the lollyplot is showing those 5 variants to be on gene PCDHB4.

Annovar link - https://annovar.openbioinformatics.org/en/latest/
Lollyplot link - https://www.bioconductor.org/packages/devel/bioc/vignettes/trackViewer/inst/doc/lollipopPlot.html#Variant_Call_Format_(VCF)_data ( see under vcf input format)

Any help will be super great !!
Thank you
Smeeta

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants