-
Notifications
You must be signed in to change notification settings - Fork 2
VEP
VEP Annotation implementation details
VEP provide a merged database, but we need the ability to pick a single entry for a variant (ie worst/canonical transcript) and since Ensembl is much larger than RefSeq, they almost always pick Ensembl. I can't see a way to run VEP twice (eg performing --flag-pick after generating per-transcript data) - thus we can't get RefSeq canonical choices.
We thus need to install either Ensembl or RefSeq VEP annotations not the merged one.
Pathogenicity prediction calculations are performed per-transcript. Problem: We store the most damaging for any transcript rather than the specific entry for that transcript.
dbNSFP has multiple values per variant that correspond to different Uniprot_entry/Uniprot_acc/Ensembl_proteinid entries eg:
'MutationAssessor_pred': 'M&M&.&.', 'MutationAssessor_score': '2.86&2.86&.&.',
'MutationTaster_pred': 'D&D&D', 'MutationTaster_score': '1&1&1',
'Polyphen2_HVAR_pred': 'D&D&.&.', 'Polyphen2_HVAR_score': '1.0&1.0&.&.'
'FATHMM_pred': '.&.&D&.', 'FATHMM_score': '.&.&-2.46&.'
See README:
https://drive.google.com/file/d/1y8uaJE44YHBORQQ2kp0xGinuYGgD0V5e/view
What I should do, is pull out the multiple entries, and the fields that they correspond to - ie Uniprot_entry/Uniprot_acc/Ensembl_proteinid to work out how to assign the correct single entry to a transcript.
The implementation is difficult as:
- Ensembl/RefSeq don't pefectly match, so RefSeq will basically end up with no entries a lot of times
- I can't look up UniProt <-> Transcript easily without downloading a 20G UniProt database and making a lookup in VariantGrid