GitHub - EpistasisLab/VEPDB_populator: Population utilities for the VEPDB distributed annotation database, with an annotator written in Python

VEPDB_populator

Brian S. Cole PhD, Dichen Li MCIT, Zhengxuan Wu, and Yingjie Luan

Institute for Biomedical Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia PA

This directory contains Python source to populate a Cassandra database with genetic variant effect annotations from ENSEMBL Variant Effect Predictor (VEP) output in VCF format (VEP VCF), in which annotation information are stored as CSQ strings.

db_populator.py: parallel Cassandra database population built on the Datastax Python driver and the multiprocessing library for parallel execution. Input is a VEP VCF file.

An example input line (only the INFO column is displayed):

CSQ=A|downstream_gene_variant|MODIFIER|KLHL17|ENSG00000187961|Transcript|ENST00000463212|retained_intron|||||||||||4136|1|HGNC|24023|1|2|3|4

Multiple comma-separated annotations are all handled separately and a collection of the annotations is built.

An example input line with multiple annotations:

1 901994 G A CSQ=A|downstream_gene_variant|MODIFIER|KLHL17|ENSG00000187961|Transcript|ENST00000463212|retained_intron|||||||||||4136|1|HGNC|24023||||,A|upstream_gene_variant|MODIFIER|PLEKHN1|ENSG00000187583|Transcript|ENST00000480267|retained_intron|||||||||||4261|1|HGNC|25284||||

Multiple annotations can arise from multiple genes (as shown in this example: KLHL17 and PLEKHN1 are separate, comma-separated annotations), multiple isoforms of the same gene, or polyallelic variants (two or more ALT alleles) which may be layered additionally on top of multiple genes/isoforms.

##Note:

Running this script requires a running Cassandra node or cluster as one or more contact points (IP addresses). This means the Cassandra node/cluster must be running (nodetool status reports UN status "up-normal"), configured to accept connections over the ports Cassandra requires (9042/9160 e.g.), and with the keyspace, table, and user-defined type already declared. We provide a CQL script, create_table.cql, which you can source from the CQLSH on a running Cassandra node to automate the creation of the vepdb_keyspace, the vepdb table, and the annotation user-defined type.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
t		t
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
annotation_vcf_from_cassandraDB.py		annotation_vcf_from_cassandraDB.py
configure_cassandra.py		configure_cassandra.py
create_table.cql		create_table.cql
create_table.cql.bak		create_table.cql.bak
csv_populator.py		csv_populator.py
db_populator.py		db_populator.py
db_query.py		db_query.py
gen_type.py		gen_type.py
install_cassandra.sh		install_cassandra.sh
new_schema.cql		new_schema.cql
populate_cassandra_gzip.py		populate_cassandra_gzip.py
populate_cassandra_gzip.py.old		populate_cassandra_gzip.py.old
simple_populator.py		simple_populator.py
type.csv		type.csv
vep_vcf_to_csv.py		vep_vcf_to_csv.py

License

EpistasisLab/VEPDB_populator

Folders and files

Latest commit

History

Repository files navigation

VEPDB_populator

Brian S. Cole PhD, Dichen Li MCIT, Zhengxuan Wu, and Yingjie Luan

Institute for Biomedical Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia PA

An example input line (only the INFO column is displayed):

CSQ=A|downstream_gene_variant|MODIFIER|KLHL17|ENSG00000187961|Transcript|ENST00000463212|retained_intron|||||||||||4136|1|HGNC|24023|1|2|3|4

An example input line with multiple annotations:

1 901994 G A CSQ=A|downstream_gene_variant|MODIFIER|KLHL17|ENSG00000187961|Transcript|ENST00000463212|retained_intron|||||||||||4136|1|HGNC|24023||||,A|upstream_gene_variant|MODIFIER|PLEKHN1|ENSG00000187583|Transcript|ENST00000480267|retained_intron|||||||||||4261|1|HGNC|25284||||

About

Resources

License

Stars

Watchers

Forks

Languages