transcript list does not match prescored transcripts for hg38 #34

aerval · 2020-02-03T16:02:27Z

I noticed that for GRCh38 the prescored file does contain more transcripts than annotated via the script. Therefore variants are annotated differently.

i.e. tabix spliceai_scores.raw.snv.hg38.vcf.gz 17:7013943-7013943 results for me in this output:

17      7013943 .       A       C       .       .       SpliceAI=C|AC040977.1|0.00|0.00|0.00|0.00|3|2|19|2
17      7013943 .       A       C       .       .       SpliceAI=C|RNASEK-C17orf49|0.00|0.00|0.00|0.00|-5|36|5|-6
17      7013943 .       A       C       .       .       SpliceAI=C|RNASEK|0.00|0.00|0.00|0.00|10|36|10|9
17      7013943 .       A       G       .       .       SpliceAI=G|AC040977.1|0.00|0.00|0.00|0.00|42|2|-10|2
17      7013943 .       A       G       .       .       SpliceAI=G|RNASEK-C17orf49|0.00|0.00|0.03|0.00|9|36|9|-6
17      7013943 .       A       G       .       .       SpliceAI=G|RNASEK|0.00|0.00|0.00|0.00|1|36|9|-6
17      7013943 .       A       T       .       .       SpliceAI=T|AC040977.1|0.00|0.00|0.00|0.00|42|2|-10|2
17      7013943 .       A       T       .       .       SpliceAI=T|RNASEK-C17orf49|0.00|0.00|0.11|0.00|10|36|-6|9
17      7013943 .       A       T       .       .       SpliceAI=T|RNASEK|0.00|0.00|0.01|0.00|10|36|-6|9

while annotations/grch38.txt only contains the transcript coordinates for RNASEK-C17orf49 and not for RNASEK and AC040977.1. However, the latter two are found in annotations/grch37.txt. I assume this is because you calculated scores only for GRCh37 did liftover from there. Is there a reason why RNASEK and AC040977.1 (and apparently many others) are not in the GRCh38 annotation? Those are active genes in Ensembl so I assume they are relatively recently added genes that somehow you only added to the GRCh37 list and not GRCh38?

The text was updated successfully, but these errors were encountered:

kishorejaganathan · 2020-02-15T00:42:53Z

Your assumption is correct. The precomputed scores were produced only for GRCh37, and they were then lifted over to GRCh38. However, the files annotations/grch37.txt and annotations/grch38.txt are not lifted over versions of each other. We downloaded both of those from the UCSC genome browser, and filtered out some genes from annotations/grch38.txt as they had different number of exons/transcript lengths in the two builds. So, when you use the grch38 option, if you see a score, it will match the score on the precomputed lifted over files, but the lifted over files might contain a few more transcripts.

aerval · 2020-03-04T15:18:11Z

Just so you know, this is not totally correct. We found that you included genes on the Y chromosomal pseudoautosomal region in the annotations file that are not in the prescored file (only prescored for the X coordinates). Furthermore we found that the annotations for GRCh37 includes the gene SCGB1C2 on chromosome 11 while on GRCh38 it is on chromosome 17. While the latter is in line with Gencode, the GRCh37 position is totally off and can only be partially explained by the gene SCGB1C1 being at a mostly overlapping position. However, due to the liftover, in the GRCh38 prescored file SCGB1C2 variants are now all located on chromosome 11 too.

I would recommend to remove SCGB1C2 and the pseudoautosomal region genes from the SpliceAI GRCh38 annotations. However, I am not 100% sure SCGB1C2 is the only case of such a coordinate swap since I have no idea how it may have originated.

kishorejaganathan · 2020-03-06T12:12:25Z

Thanks for letting me know, I got to the bottom of this issue. The GENCODE annotations are originally in hg38 and the hg19 version is obtained via hg38ToHg19.over.chain, and the SpliceAI scores are originally in hg19 and the hg38 version is obtained via hg19ToHg38.over.chain. For this gene, the two liftovers are not reversible unfortunately. chr17:137525 in hg38 goes to chr11:193034 in hg19, but chr11:193034 in hg19 seems to stay at chr11:193034 in hg38 as well.

This issue seems to affect 17 genes in total:
PPIAL4E
MUC12
OR4C46
OR4A8
OR4A5
IGHV1-69-2
RP11-294C11.4
RP11-294C11.2
POTEB
SCGB1C2
CBSL
U2AF1L5
CH507-152C13.3
SMIM11B
OR11H1
SPANXB1
OPN1MW3

I'll take these out from the annotations file for the sake of consistency.

aerval · 2020-03-06T12:20:21Z

Awesome, that totally makes sense!

smeeta · 2021-11-18T05:53:22Z

Hi !

Any idea why hg38 ANNOVAR annotated variants (gene) do not match the TxDb.Hsapiens.UCSC.hg38.knownGene ?

I used hg 38 to annotate my.vcf file in ANNOVAR however when I used this my_filtered.vcf to plot variants using lollyplot from trackViewer package. I get different gene name. The latter uses TxDb.Hsapiens.UCSC.hg38.knownGene db.
Example I have 5 variants in gene PCDHA7 but the lollyplot is showing those 5 variants to be on gene PCDHB4.

Annovar link - https://annovar.openbioinformatics.org/en/latest/
Lollyplot link - https://www.bioconductor.org/packages/devel/bioc/vignettes/trackViewer/inst/doc/lollipopPlot.html#Variant_Call_Format_(VCF)_data ( see under vcf input format)

Any help will be super great !!
Thank you
Smeeta

kishorejaganathan closed this as completed Feb 15, 2020

kishorejaganathan mentioned this issue May 22, 2020

Gene AC068620.1 location #45

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transcript list does not match prescored transcripts for hg38 #34

transcript list does not match prescored transcripts for hg38 #34

aerval commented Feb 3, 2020

kishorejaganathan commented Feb 15, 2020

aerval commented Mar 4, 2020 •

edited

Loading

kishorejaganathan commented Mar 6, 2020

aerval commented Mar 6, 2020

smeeta commented Nov 18, 2021

transcript list does not match prescored transcripts for hg38 #34

transcript list does not match prescored transcripts for hg38 #34

Comments

aerval commented Feb 3, 2020

kishorejaganathan commented Feb 15, 2020

aerval commented Mar 4, 2020 • edited Loading

kishorejaganathan commented Mar 6, 2020

aerval commented Mar 6, 2020

smeeta commented Nov 18, 2021

aerval commented Mar 4, 2020 •

edited

Loading