Skip to content
Dave Lawrence edited this page Oct 16, 2020 · 2 revisions

VEP Annotation implementation details

RefSeq vs Ensembl

VEP provide a merged database, but we need the ability to pick a single entry for a variant (ie worst/canonical transcript) and since Ensembl is much larger than RefSeq, they almost always pick Ensembl. I can't see a way to run VEP twice (eg performing --flag-pick after generating per-transcript data) - thus we can't get RefSeq canonical choices.

We thus need to install either Ensembl or RefSeq VEP annotations not the merged one.

Pathogenicity prediction

Pathogenicity prediction calculations are performed per-transcript. Problem: We store the most damaging for any transcript rather than the specific entry for that transcript.

dbNSFP has multiple values per variant that correspond to different Uniprot_entry/Uniprot_acc/Ensembl_proteinid entries eg:

'MutationAssessor_pred': 'M&M&.&.', 'MutationAssessor_score': '2.86&2.86&.&.',
'MutationTaster_pred': 'D&D&D', 'MutationTaster_score': '1&1&1',
'Polyphen2_HVAR_pred': 'D&D&.&.', 'Polyphen2_HVAR_score': '1.0&1.0&.&.'
'FATHMM_pred': '.&.&D&.', 'FATHMM_score': '.&.&-2.46&.'

See README:

https://drive.google.com/file/d/1y8uaJE44YHBORQQ2kp0xGinuYGgD0V5e/view

What I should do, is pull out the multiple entries, and the fields that they correspond to - ie Uniprot_entry/Uniprot_acc/Ensembl_proteinid to work out how to assign the correct single entry to a transcript.

The implementation is difficult as:

  • Ensembl/RefSeq don't pefectly match, so RefSeq will basically end up with no entries a lot of times
  • I can't look up UniProt <-> Transcript easily without downloading a 20G UniProt database and making a lookup in VariantGrid

See also

Clone this wiki locally