HIPS: Hiding In Plain Site
Unannotated structural variants in public genomic data sets
Rebecca Torene, Alexandrea Stylianou, Jitong Cai, Ariel Gershman, Alexandra Weber, and Nancy Hansen
With thanks to former UMUC Bioinformatics master program students:
- Mary Schramko
- Upasana Pandey
- Eric Keller
- Mike Fox
- Al Koroma
- David Jalali
Public data, such as gnomAD, makes VCFs of short nucleotide variants (SNVs) available. There is anecdotal evidence that some apparent “insertions” are actually part of larger structural variants
Use VCFs of SNVs to call structural variants. In doing so, can we identify SVs that have otherwise escaped detection.
Large, aggregate, genomics data sets of short nucleotide variants (SNVs) such as gnomAD/ExAC are an invaluable resource to the research and clinical genetics communities. Such data sets are often used as a control to determine the frequencies of variants in unaffected populations and gnomAD has recently released a structural variant call set on a subset of their samples. There is evidence, however, that additional SVs exist in the data, but haven't yet been annotated. For example, this insertion is actually indicative of a known deletion. Such anecdotal findings of deletions, processed pseudogenes, inversions, tandem duplications, and mobile element insertions exist in publicly available data, hiding there in plain sight.
Without individual alignment files, the broader genomics community cannot evaluate structural variants in the complete data set. We, therefore, developed HIPS, a structural variant caller that uses VCF of SNVs as input and then outputs structural variant calls in bed format. HIPS starts from VCF, parses insertion sequences and converts to fastA, aligns by BLAST, and filters BLAST results for possible structural variants. Running the snakefile then merges similar calls and adds annotation. The Shiny app takes gnomAD data processed by HIPS and provides a web interface.
Suggested Future Improvements:
This project may be on GitHub and it may have a great name (HIPS), but we cannot deny the fact that this is the product of a 3-day hackathon and that it is imperfect. We welcome the GitHub community to make improvements and suggest the following as future improvements:
- fastaToBed.py can be modified to split VCFs by position (already splits by chromosome) so that parallelization is more efficient
- fastaToBed.py arguments could be refined (e.g., meiOnly could be boolean flag instead of string input)
- stitch together fastaToBed.py and snakefile to have a seamless pipeline from VCF to annotated SV calls
- Add more flexibility for file names, species, etc.
- Option to output SV calls as VCF (currently outputs BED)
- Add a low quality flag for SVs based on bitscore and based on overlap of breakpoints with simple repeats
Results from applying to gnomAD SNV VCFs:
Once Rstudio is installed, import:
shiny dplyr shinyWidgets devtools
To host the shiny app we will use:
For back end development - follow the steps below to create your environment and install an editor.
I prefer the free version of Pycharm, found here
- Download miniconda and choose Python 3.7 64-bit installer.
After downloading it make sure to open a new terminal and test with `which conda`.
MDMBASTYLIANOU:~ astylianou900045$ which conda /Users/astylianou900045/miniconda3/bin/conda
- Add conda channels - do this to get all the packages needed for this project:
conda config --add channels defaults conda config --add channels bioconda conda config --add channels conda-forge
Then check that everything installed correctly with: conda config --show channels
MDMBASTYLIANOU:~ astylianou900045$ conda config --show channels channels: - conda-forge - bioconda - defaults
Now that conda is installed, clone the repo
3. git clone https://github.com/NCBI-Hackathons/Hiding-in-plain-sight-unannotated-structural-variants-in-public-genomic-data-sets.git
Cloning will make a new directory. After it is cloned, enter that directory with 'cd'
4. cd Hiding-in-plain-sight-unannotated-structural-variants-in-public-genomic-data-sets
Time to create and activate the environment for this project:
conda env create -f environment.yml
conda activate sv_env
You will know that it was activated because sv_env will show in your command prompt, for example:
(sv_env) MDMBASTYLIANOU:~ astylianou900045$
- Run test data in MEI-only mode. Should create a bed file with MEI calls.
python /path/to/repo/vcfToBed.py \ -inFile /path/to/repo/test/test.vcf.gz \ -outFile out.fa -chr 21 \ -dir /path/to/repo/resources -meiOnly True
-inFileis a gzip'ed VCF with SNV calls in hg19.
-outFileis the name of the output for the fastA
-chris the chromosome to work on (without leading 'chr')
-diris the directory where BLAST dictionaries are located. The MEI BLAST dictionary is located in the
resourcesdirectory of this repo
-meiOnlyis whether to detect only MEI structural variants.
Running on other SV types (deletions, inversions, and tandem duplications)
Download human reference sequences for each chromosome for hg19 ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/README.txt
Make BLAST database for each chromosome
for f in *.fa.gz; do gunzip -k $f; done for f in *.fa; do makeblastdb -in $f -dbtype nucl; done
Make sure your MEI BLAST database and chromosome BLAST databases are in the same directory
Run test data
python /path/to/repo/vcfToBed.py \ -inFile /path/to/repo/test/test.vcf.gz \ -outFile out.fa -chr 21 \ -dir /path/to/repo/resources -meiOnly ""