Skip to content

raonyguimaraes/pynnotator

Repository files navigation

Pynnotator

(https://circleci.com/gh/raonyguimaraes/pynnotator.svg?style=svg)

This is a Python library developed with the goal of helping annotate VCF files from Exome or Genomes of individuals with Mendelian Disorders.

It was built using state-of-the-art tools and databases for human genome annotation.

Development

git clone https://github.com/raonyguimaraes/pynnotator
cd pynnotator
python3 -m venv venv
source venv/bin/activate
python setup.py develop
pynnotator install
pynnotator test

This is what you should get:

pynnotator test
Testing Annotation...                             
Running Command gunzip -c -d /home/raony/dev/pynnotator/pynnotator/tests/sample.100.vcf.gz > /home/raony/dev/pynnotator/pynnotator/tests/ann_sample.100/sample.100.vcf
2021-03-21 03:18:20.152735 Starting sanity_check:  /home/raony/dev/pynnotator/pynnotator/tests/ann_s
ample.100/sample.100.vcf                                                                            
sort -k1,1d -k2,2n                                
2021-03-21 03:18:20.169532 Finished sanity_check, it took:  0:00:00.016797
2021-03-21 03:18:20.170173 Starting snpEff annotation:  sanity_check/sorted.vcf
2021-03-21 03:18:20.170460 Starting vep annotation:  sanity_check/sorted.vcf
2021-03-21 03:18:20.171451 Starting snpsift annotation:  sanity_check/sorted.vcf
2021-03-21 03:18:20.389561 Finished snpsift annotation, it took:  0:00:00.218110
2021-03-21 03:18:54.005021 Finished snpEff annotation, it took:  0:00:33.834848
2021-03-21 03:20:26.368233 Finished vep annotation, it took:  0:02:06.197773
2021-03-21 03:20:26.368687 Merging all VCF Files...
2021-03-21 03:20:26.368969 Starting merge:  sanity_check/sorted.vcf

=============================================
vcfanno version 0.3.2 [built with go1.12.1]

see: https://github.com/brentp/vcfanno
=============================================
vcfanno.go:115: found 10 sources from 3 files
vcfanno.go:156: falling back to non-bgzip
vcfanno.go:194: Info Error: CSQ not found in INFO >> this error/warning may occur many times. reporting once here...
vcfanno.go:248: annotated 45 variants in 0.00 seconds (10489.1 / second)
2021-03-21 03:20:26.416464 Finished merge, it took:  0:00:00.047495
2021-03-21 03:20:26.416888 Convert VCF to CSV...
2021-03-21 03:20:26.448489 Finished Annotation, it took 0:02:06.299032

A       A G       T G       A       A G       T G       A
| C   C | | C   C | | A   C | C   C | | C   C | | A   C |
| | T | | | | A | | | | G | | | T | | | | A | | | | G | |
| G   G | | G   G | | T   G | G   G | | G   G | | T   G |
T       T C       A C       T       T C       A C       T

Installation

Using conda:

conda install pynnotator
pynnotator install
pynnotator -i sample.vcf

Using pip:

pip install pynnotator
pynnotator install
pynnotator -i sample.vcf

Using docker-compose

cd compose
bash run-pynnotator-with-docker.sh

Languages

  • Perl
  • Python
  • Java
  • Go

Tools

  • vep (version 91.1)
  • snpeff (SnpEff 4.3r)
  • htslib (1.5)
  • vcftools (0.1.15)
  • vcfanno (v0.3.2)

Databases

  • 1000Genomes (Phase 3) - ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf
  • dbSNP (including clinvar) - (human_9606_b150_GRCh37p13)
  • Exome Sequencing Project - ESP6500SI-V2-SSA137.GRCh38-liftover
  • dbNFSP 3.5a (including dbscSNV 1.1)
  • Ensembl 90 (phenotype and clinically associated variants)
  • Decipher (HI_Predictions_Version3 and DDG2P)

Features

  • Annotate an exome in only 10 minutes.
  • Supports .VCF and .VCF.GZ files.
  • 20 min installation.
  • Multithread efficient!
  • Annotate a VCF file using multiple VCFs as a reference.
  • Combine the best tools and databases currently available for vcf annotation.

Files

.
├── 1000genomes
│   ├── ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz
│   └── ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz.tbi
├── dbnsfp
│   ├── dbNSFP3.4a.txt.gz
│   ├── dbNSFP3.4a.txt.gz.tbi
│   ├── dbscSNV1.1.txt.gz
│   └── dbscSNV1.1.txt.gz.tbi
├── dbsnp
│   ├── All_20170403.vcf.gz
│   ├── All_20170403.vcf.gz.tbi
│   ├── clinvar.vcf.gz
│   └── clinvar.vcf.gz.tbi
├── decipher
│   ├── DDG2P.csv.gz
│   ├── HI_Predictions_Version3.bed.gz
│   ├── HI_Predictions_Version3.bed.gz.tbi
│   └── population_cnv.txt.gz
├── ensembl
│   ├── Homo_sapiens_clinically_associated.vcf.gz
│   ├── Homo_sapiens_clinically_associated.vcf.gz.tbi
│   ├── Homo_sapiens_phenotype_associated.vcf.gz
│   └── Homo_sapiens_phenotype_associated.vcf.gz.tbi
├── esp6500
│   ├── esp6500si.vcf.gz
│   └── esp6500si.vcf.gz.tbi
├── snpeff_data
│   └── GRCh37.75
└── vep_cache
    └── homo_sapiens
        └── 88_GRCh37

705 directories, 11839 files

Examples of VCFs from patients with Mendelian Disorders

.
├── annotation.validated.vcf.gz
├── examples
│   ├── miller.vcf.gz
│   ├── NA12878.compound_heterozygous.vcf.gz
│   ├── NA12878.dominant.vcf.gz
│   ├── NA12878.recessive.vcf.gz
│   ├── NA12878.xlinked.vcf.gz
│   └── schinzel_giedion.vcf.gz
└── sample.1000.vcf

Requirements

  • Docker Compose or
  • Ubuntu 16.04 LTS or Red Hat/CentOS 7
  • Python 2 or 3

How to run it?

Requires at least 65GB of disk space during installation and 35GB after installed.

1º Method::

docker-compose run pynnotator -i pynnotator/tests/sample.1000.vcf
or
docker-compose run pynnotator -i sample.vcf.gz

2º Method::

# Using Ubuntu 16.04 LTS

sudo apt-get install gcc git python3-dev zlib1g-dev make zip libssl-dev libbz2-dev liblzma-dev libcurl4-openssl-dev build-essential
python3 -m venv mendelmdenv
source mendelmdenv/bin/activate
pip install pynnotator
pynnotator install

#And them finally:
pynnotator -i sample.vcf
#or
pynnotator -i sample.vcf.gz

Options

You can change settings of memory usage and number of cores in settings.py

Test

pynnotator test

Others

pynnotator install
#this will download and install all libraries and data needed.
pynnotator build
#this will rebuild the whole dataset required from scratch (this will take about 8h hours and requires a lot of memory)

Development

 git clone https://github.com/raonyguimaraes/pynnotator
 python setup.py develop
 # And have fun!

Annotations you can get from dbnfsp

Major sources:

    Variant determination:
            Gencode release 22/Ensembl 79, released March, 2015 (hg38)
    Functional predictions:
            SIFT ensembl 66, released Jan, 2015 http://provean.jcvi.org/index.php
            PROVEAN 1.1 ensembl 66, released Jan, 2015 http://provean.jcvi.org/index.php
            Polyphen-2 v2.2.2, released Feb, 2012 http://genetics.bwh.harvard.edu/pph2/
            LRT, released November, 2009 http://www.genetics.wustl.edu/jflab/lrt_query.html
            MutationTaster 2, data retrieved in 2015 http://www.mutationtaster.org/
            MutationAssessor, release 3 http://mutationassessor.org/
            FATHMM, v2.3 http://fathmm.biocompute.org.uk
            fathmm-MKL, http://fathmm.biocompute.org.uk/fathmmMKL.htm
            CADD, v1.3 http://cadd.gs.washington.edu/
            VEST, v3.0 http://karchinlab.org/apps/appVest.html
            fitCons, v1.01 http://compgen.bscb.cornell.edu/fitCons/
            DANN, https://cbcl.ics.uci.edu/public_data/DANN/
            MetaSVM and MetaLR, doi: 10.1093/hmg/ddu733
            GenoCanyon, v1.0.3 http://genocanyon.med.yale.edu/index.html
            Eigen & Eigen PC, v1.1 http://www.columbia.edu/~ii2135/eigen.html
            M-CAP, v1.0 http://bejerano.stanford.edu/MCAP/
            REVEL, https://sites.google.com/site/revelgenomics/
            MutPred, v1.2 http://mutpred.mutdb.org/
    Conservation scores:
            phyloP100way_vertebrate (hg38) http://hgdownload.soe.ucsc.edu/goldenPath/hg38/phyloP100way/
            phyloP20way_mammalian (hg38) http://hgdownload.soe.ucsc.edu/goldenPath/hg38/phyloP20way/
            phastCons100way_vertebrate (hg38) http://hgdownload.soe.ucsc.edu/goldenPath/hg38/phastCons100way/
            phastCons20way_mammalian (hg38) http://hgdownload.soe.ucsc.edu/goldenPath/hg38/phastCons20way/
            GERP++ http://mendel.stanford.edu/SidowLab/downloads/gerp/
            SiPhy http://www.broadinstitute.org/mammals/2x/siphy_hg19/
    Other variant annotation sources:
            Interpro v56 http://www.ebi.ac.uk/interpro/
            1000 Genomes project http://www.1000genomes.org/
            ESP http://evs.gs.washington.edu/EVS/
            dbSNP 147 (hg38) ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh38p2/VCF/All_20160527.vcf.gz
            clinvar 20161101 (hg38) ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20161101.vcf.gz
            ExAC v0.3 http://exac.broadinstitute.org/
            UK10K COHORT http://www.uk10k.org/studies/cohorts.html
            Ancestral alleles (hg38) ftp://ftp.ensembl.org/pub/release-84/fasta/ancestral_alleles
            Altai Neanderthal genotypes: http://cdna.eva.mpg.de/neandertal/altai/AltaiNeandertal/VCF/
            Denisova genotypes: http://www.eva.mpg.de/denisova
            RSRS http://dx.doi.org/10.1016/j.ajhg.2012.03.002
            GTEx v6 http://www.gtexportal.org/static/datasets/gtex_analysis_v6/single_tissue_eqtl_data/
    Other gene annotation sources:
            HGNC, downloaded on March 15, 2016
            Uniprot, released 2016_2
            IntAct, downloaded on March 15, 2016
            GWAS catalog, downloaded on March 15, 2015
            egenetics and GNF/Atlas expression data, downloaded from BioMart on Oct. 1, 2013
            BioGRID, version 3.4.134
            Haploinsufficiency probability data, from doi:10.1371/journal.pgen.1001154
            Recessive probability data, from DOI:10.1126/science.1215040
            Residual Variation Intolerance Score (RVIS), from http://genic-intolerance.org/
            GO, downloaded on March 15, 2016
            ConsensusPathDB, Release 31
            Essential genes, based on doi:10.1371/journal.pgen.1003484
            Mouse genes, from ftp://ftp.informatics.jax.org/pub/reports/index.html on March 15, 2016
            Zebra fish genes, from http://zfin.org/downloads/pheno.txt on March 15, 2016
            KEGG pathway, from http://www.openbioinformatics.org/gengen/tutorial_calculate_gsea.html
            BioCarta pathway, from http://www.openbioinformatics.org/gengen/tutorial_calculate_gsea.html
            GTEx v6 http://www.gtexportal.org/static/datasets/gtex_analysis_v6/rna_seq_data/
            GDI doi: 10.1073/pnas.1518646112
            LoFtool: joao.fadista@med.lu.se
            SORVA: doi: 10.1101/103218

Annotation example

cd tests
pynnotator -i miller.vcf.gz
grep 'Miller' ann_miller/annotation.final.vcf

16      72050942        rs267606766     G       A       287.41  PASS    AC=1;AF=0.50;AN=2;BaseQRankSum=2.237;DB;DP=13;Dels=0.00;FS=5.119;HRun=0;HaplotypeScore=0.0000;MQ0=0;MQ=60.00;MQRankSum=0.231;QD=22.11;ReadPosRankSum=-0.077;set=variant2;EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Ggg/Agg|G152R|395|DHODH|protein_coding|CODING|ENST00000219240|4|A);CSQ=A|missense_variant|MODERATE|DHODH|ENSG00000102967|Transcript|ENST00000219240|protein_coding|4/9||||475/2065|454/1188|152/395|G/R|Ggg/Agg|||1||HGNC|2867|deleterious(0)|probably_damaging(1);SNP;HET;VARTYPE=SNP;HI_PREDICTIONS=DHODH|0.325470662|25.78%;dbsnp.RS=267606766;dbsnp.RSPOS=72050942;dbsnp.dbSNPBuildID=137;dbsnp.SSR=0;dbsnp.SAO=1;dbsnp.VP=0x050268000a05040002110100;dbsnp.GENEINFO=DHODH:1723;dbsnp.WGT=1;dbsnp.VC=SNV;dbsnp.PM;dbsnp.PMC;dbsnp.S3D;dbsnp.NSM;dbsnp.REF;dbsnp.ASP;dbsnp.VLD;dbsnp.LSD;dbsnp.OM;clinvar.RS=267606766;clinvar.RSPOS=72050942;clinvar.dbSNPBuildID=137;clinvar.SSR=0;clinvar.SAO=1;clinvar.VP=0x050268000a05040002110100;clinvar.GENEINFO=DHODH:1723;clinvar.WGT=1;clinvar.VC=SNV;clinvar.PM;clinvar.PMC;clinvar.S3D;clinvar.NSM;clinvar.REF;clinvar.ASP;clinvar.VLD;clinvar.LSD;clinvar.OM;clinvar.CLNALLE=1;clinvar.CLNHGVS=NC_000016.9:g.72050942G>A;clinvar.CLNSRC=OMIM_Allelic_Variant|UniProtKB_(protein);clinvar.CLNORIGIN=1;clinvar.CLNSRCID=126064.0004|Q02127#VAR_062414;clinvar.CLNSIG=5;clinvar.CLNDSDB=MedGen:OMIM:SNOMED_CT;clinvar.CLNDSDBID=C0265257:263750:66038001;clinvar.CLNDBN=Miller_syndrome;clinvar.CLNREVSTAT=no_criteria;clinvar.CLNACC=RCV000018294.28;esp6500.DBSNP=dbSNP_138;esp6500.EA_AC=1,8301;esp6500.AA_AC=0,3878;esp6500.TAC=1,12179;esp6500.MAF=0.012,0.0,0.0082;esp6500.GTS=AA,AG,GG;esp6500.EA_GTC=0,1,4150;esp6500.AA_GTC=0,0,1939;esp6500.GTC=0,1,6089;esp6500.DP=130;esp6500.GL=DHODH;esp6500.CP=0.8;esp6500.CG=5.8;esp6500.AA=G;esp6500.CA=.;esp6500.EXOME_CHIP=no;esp6500.GWAS_PUBMED=.;esp6500.FG=NM_001361.4:missense;esp6500.HGVS_CDNA_VAR=NM_001361.4:c.454G>A;esp6500.HGVS_PROTEIN_VAR=NM_001361.4:p.(G152R);esp6500.CDS_SIZES=NM_001361.4:1188;esp6500.GS=125;esp6500.PH=probably-damaging:1.0;esp6500.EA_AGE=.;esp6500.AA_AGE=.;esp6500.GRCh38_POSITION=16:72017043 GT:AD:DP:GQ:PL  0/1:4,9:13:99:317,0,101
16      72055110        rs267606767     G       C       287.41  PASS    AC=1;AF=0.50;AN=2;BaseQRankSum=2.237;DB;DP=13;Dels=0.00;FS=5.119;HRun=0;HaplotypeScore=0.0000;MQ0=0;MQ=60.00;MQRankSum=0.231;QD=22.11;ReadPosRankSum=-0.077;set=variant2;EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|gGc/gCc|G202A|395|DHODH|protein_coding|CODING|ENST00000219240|5|C);CSQ=C|missense_variant|MODERATE|DHODH|ENSG00000102967|Transcript|ENST00000219240|protein_coding|5/9||||626/2065|605/1188|202/395|G/A|gGc/gCc|||1||HGNC|2867|tolerated(0.18)|possibly_damaging(0.893);SNP;HET;VARTYPE=SNP;HI_PREDICTIONS=DHODH|0.325470662|25.78%;dbsnp.RS=267606767;dbsnp.RSPOS=72055110;dbsnp.dbSNPBuildID=137;dbsnp.SSR=0;dbsnp.SAO=1;dbsnp.VP=0x050268000a05040002110100;dbsnp.GENEINFO=DHODH:1723;dbsnp.WGT=1;dbsnp.VC=SNV;dbsnp.PM;dbsnp.PMC;dbsnp.S3D;dbsnp.NSM;dbsnp.REF;dbsnp.ASP;dbsnp.VLD;dbsnp.LSD;dbsnp.OM;dbsnp.TOPMED=0.999828,0.000171715,.;clinvar.RS=267606767;clinvar.RSPOS=72055110;clinvar.dbSNPBuildID=137;clinvar.SSR=0;clinvar.SAO=1;clinvar.VP=0x050268000a05040002110100;clinvar.GENEINFO=DHODH:1723;clinvar.WGT=1;clinvar.VC=SNV;clinvar.PM;clinvar.PMC;clinvar.S3D;clinvar.NSM;clinvar.REF;clinvar.ASP;clinvar.VLD;clinvar.LSD;clinvar.OM;clinvar.CLNALLE=1,2;clinvar.CLNHGVS=NC_000016.9:g.72055110G>A,NC_000016.9:g.72055110G>C;clinvar.CLNSRC=OMIM_Allelic_Variant|UniProtKB_(protein),OMIM_Allelic_Variant|UniProtKB_(protein);clinvar.CLNORIGIN=1,1;clinvar.CLNSRCID=126064.0006|Q02127#VAR_062417,126064.0005|Q02127#VAR_062416;clinvar.CLNSIG=5,5;clinvar.CLNDSDB=MedGen:OMIM:SNOMED_CT,MedGen:OMIM:SNOMED_CT;clinvar.CLNDSDBID=C0265257:263750:66038001,C0265257:263750:66038001;clinvar.CLNDBN=Miller_syndrome,Miller_syndrome;clinvar.CLNREVSTAT=no_criteria,no_criteria;clinvar.CLNACC=RCV000018296.27,RCV000018295.27   GT:AD:DP:GQ:PL       0/1:4,9:13:99:317,0,101

About

This is a Genome Annotation Framework developed with the goal of annotating VCF files (Exomes or Genomes) from patients with Mendelian Disorders.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages