Skip to content

Latest commit

 

History

History
67 lines (60 loc) · 2.64 KB

README.md

File metadata and controls

67 lines (60 loc) · 2.64 KB

VCF VarSelect

This is a python package to select whole-exome sequencing variants from VCF files. The parse of VCF file is based on VCF Parser. Paper is coming soon.

Installation

git clone http://github.com/DanyangLi107/vcf_varselect.git
cd vcf_varselect
pip3 install .

File preparation

vcf file: a text file with extension .vcf or .vcf.gz. A tiny example can be found in: /example/example.vcf

gene_file: disorder related gene list, an example is in: /example/gene_list.csv

gender_file: male samples list, an example is in: /example/male_list.txt

innerfreq_file: file of variant inner-freq from all samples, an example is in: /example/inner_freq.json

Basic function

Return a dictionary of selected variants from one sample VCF file

Total variants in VCF:

from vcf_varselect import VariantSelection
vcf = VariantSelection(infile='file.vcf')
for sample in vcf:
    print (vcf.variant[sample])

VCF file information:

vcf.header         # header information in vcf
vcf.id_dict        # information of INFO, FORMAT and FILTER in vcf
vcf.vep_columns    # information VEP annotation in vcf
vcf.sample         # sample ID in vcf

Select variants with good quality:

quality = vcf.quality_selection(FILTER='PASS', DP=10.0, QD=2.0, MQ=40.0)

Select rare variants:

freq = vcf.freq_selection(KG=0.001, EXAC=0.001, GNOMAD=0.001, SWEGEN=0.001, innerfreqfile=innerfreq_file)

Select damaging, loss-of-function and missense variants:

damaging, lof, mis_damage = vcf.damaging_selection(criteria=['SIFT', 'POLYPHEN', 'MPC', 'CADD', 'SPIDEX', 'PHYLOP'])

Select good-quality rare damaging, rare loss-of-function and rare missense variants:

damaging_var, lof_var, mis_var = vcf.comb_selection(FILTER='PASS', DP=10.0, QD=2.0, MQ=40.0,
                                                    KG=0.001, EXAC=0.001, GNOMAD=0.001, SWEGEN=0.001, innerfreqfile=innerfreq_file,
                                                    criteria=['SIFT', 'POLYPHEN', 'MPC', 'CADD', 'SPIDEX', 'PHYLOP'])

Select variants of disorder related genes:

from vcf_varselect import match_gene
gene_var = match_gene(damaging_var, gene_file, gender_file)

Return a dataframe of disorder related rare damaging variants from multiple samples

from vcf_varselect import sample_combine
df = sample_combine(dir, innerfreq_file, gene_file, gender_file,
                    FILTER='PASS', DP=10.0, QD=2.0, MQ=40.0,
                    KG=0.001, EXAC=0.001, GNOMAD=0.001, SWEGEN=0.001,
                    criteria=['SIFT', 'POLYPHEN', 'MPC', 'CADD', 'SPIDEX', 'PHYLOP'])