Merge branch 'release/v3.2.0'

ACEnglish · Apr 1, 2022 · 1d2f42b · 1d2f42b
2 parents de2a15f + 37b6c37
commit 1d2f42b
Show file tree

Hide file tree

Showing 107 changed files with 6,598 additions and 288 deletions.
diff --git a/Dockerfile b/Dockerfile
@@ -10,10 +10,10 @@ RUN apt-get -qq update && apt-get install -yq \
 ADD . /opt/truvari-source
 WORKDIR /opt/truvari-source
 
-RUN python3 -m pip install --upgrade pip
-RUN python3 -m pip install setproctitle pylint anybadge coverage
-RUN python3 -m pip install --upgrade setuptools
-RUN python3 -m pip install ./
+RUN python3 -m pip install --upgrade pip && \
+    python3 -m pip install setproctitle pylint anybadge coverage && \
+    python3 -m pip install --upgrade setuptools && \
+    python3 -m pip install ./
 
 WORKDIR /data
 

diff --git a/README.md b/README.md
@@ -10,12 +10,14 @@
 [![pylint](imgs/pylint.svg)](https://github.com/spiralgenetics/truvari/actions/workflows/pylint.yml)
 [![FuncTests](https://github.com/spiralgenetics/truvari/actions/workflows/func_tests.yml/badge.svg?branch=develop&event=push)](https://github.com/spiralgenetics/truvari/actions/workflows/func_tests.yml)
 [![coverage](imgs/coverage.svg)](https://github.com/spiralgenetics/truvari/actions/workflows/func_tests.yml)
-[![develop](https://img.shields.io/github/commits-since/spiralgenetics/truvari/v3.0.0)](https://github.com/spiralgenetics/truvari/commits/develop)
+[![develop](https://img.shields.io/github/commits-since/spiralgenetics/truvari/v3.1.0)](https://github.com/spiralgenetics/truvari/commits/develop)
+[![Downloads](https://pepy.tech/badge/truvari)](https://pepy.tech/project/truvari)
 
-Structural variant toolkit for benchmarking, annotating and more for VCFs
+Toolkit for benchmarking, merging, and annotating Structrual Variants
 
-[WIKI page](https://github.com/spiralgenetics/truvari/wiki) has detailed documentation.
-See [Updates](https://github.com/spiralgenetics/truvari/wiki/Updates) on new versions.
+[WIKI page](https://github.com/spiralgenetics/truvari/wiki) has detailed documentation.  
+See [Updates](https://github.com/spiralgenetics/truvari/wiki/Updates) on new versions.  
+Read our [Paper](https://doi.org/10.1101/2022.02.21.481353) for more details.
 
 ## Installation
 Truvari uses Python 3.6+ and can be installed with pip:
@@ -39,10 +41,9 @@ The current most common Truvari use case is for structural variation benchmarkin
  - [anno](https://github.com/spiralgenetics/truvari/wiki/anno) - Add SV annotations to a VCF
  - [vcf2df](https://github.com/spiralgenetics/truvari/wiki/vcf2df) - Turn a VCF into a pandas DataFrame
  - [consistency](https://github.com/spiralgenetics/truvari/wiki/consistency) - Consistency report between multiple VCFs
+ - [divide](https://github.com/ACEnglish/truvari/wiki/divide) - Divide a VCF into independent parts
  - [segment](https://github.com/spiralgenetics/truvari/wiki/segment) - Normalization of SVs into disjointed genomic regions
 
 ## More Information
 
 Find more details and discussions about Truvari on the [WIKI page](https://github.com/spiralgenetics/truvari/wiki).
-
-[https://www.spiralgenetics.com](https://www.spiralgenetics.com)
diff --git a/docs/api/.readthedocs.yaml b/docs/api/.readthedocs.yaml
@@ -0,0 +1,16 @@
+# File: .readthedocs.yaml
+
+version: 2
+
+build:
+  os: "ubuntu-20.04"
+  tools:
+    python: "3.9"
+
+# Build from the docs/ directory with Sphinx
+sphinx:
+  configuration: docs/api/conf.py
+# Explicitly set the version of Python and its requirements
+python:
+  install:
+    - requirements: docs/requirements.txt
diff --git a/docs/api/conf.py b/docs/api/conf.py
@@ -56,7 +56,7 @@
 # Add any paths that contain custom static files (such as style sheets) here,
 # relative to this directory. They are copied after the builtin static files,
 # so a file named "default.css" will overwrite the builtin "default.css".
-html_static_path = ['_static']
+html_static_path = []
 
 
 # -- Extension configuration -------------------------------------------------

diff --git a/docs/api/index.rst b/docs/api/index.rst
@@ -11,7 +11,7 @@ This documentation is aimed at developers looking to reuse truvari's code.
 For those looking to use truavari as a tool for analysis, full documentation 
 is available via the github page
 
-https://github.com/spiralgenetics/truvari
+https://github.com/ACEnglish/truvari
 
 .. toctree::
    :maxdepth: 2

diff --git a/docs/api/local_build.sh b/docs/api/local_build.sh
@@ -2,5 +2,7 @@
 # requires pip install  sphinx & sphinx_rtd_theme 
 DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
 # will overwrite your python
-python3 $DIR/../../setup.py install
+cd ../../
+python3 setup.py install
+cd -
 sphinx-build $DIR $DIR/output $DIR/truvari.* 
diff --git a/docs/api/truvari.rst b/docs/api/truvari.rst
@@ -42,10 +42,6 @@ SV
 
 VariantRecord Methods
 ---------------------
-copy_entry
-^^^^^^^^^^
-.. autofunction:: copy_entry
-
 entry_boundaries
 ^^^^^^^^^^^^^^^^
 .. autofunction:: entry_boundaries
@@ -90,8 +86,8 @@ entry_size_similarity
 ^^^^^^^^^^^^^^^^^^^^^
 .. autofunction:: entry_size_similarity
 
-entry_to_key
-^^^^^^^^^^^^
+entry_to_haplotype
+^^^^^^^^^^^^^^^^^^
 .. autofunction:: entry_to_haplotype
 
 entry_to_key
@@ -112,6 +108,18 @@ bed_ranges
 ^^^^^^^^^^
 .. autofunction:: bed_ranges
 
+build_anno_tree
+^^^^^^^^^^^^^^^
+.. autofunction:: build_anno_tree
+
+calc_af
+^^^^^^^
+.. autofunction:: calc_af
+
+calc_hwe
+^^^^^^^^
+.. autofunction:: calc_hwe
+
 create_pos_haplotype
 ^^^^^^^^^^^^^^^^^^^^
 .. autofunction:: create_pos_haplotype

diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -0,0 +1,13 @@
+sphinx==4.2.0
+sphinx_rtd_theme==1.0.0
+readthedocs-sphinx-search==0.1.1
+python-Levenshtein==0.12.2
+edlib>=1.3.8.post2
+progressbar2>=3.41.0
+pysam>=0.15.2
+intervaltree>=3.0
+joblib>=1.0.1
+numpy>=1.21.2
+pytabix>=0.1
+bwapy>=0.1.4
+pandas>=1.3.
diff --git a/docs/v3.2.0/Citations.md b/docs/v3.2.0/Citations.md
@@ -0,0 +1,31 @@
+# Citing Truvari
+
+Pre-print on Biorxiv while in submission:
+
+Truvari: Refined Structural Variant Comparison Preserves Allelic Diversity  
+doi: https://doi.org/10.1101/2022.02.21.481353
+
+# Citations
+
+List of publications using Truvari. Most of these are just pulled from a [Google Scholar Search](https://scholar.google.com/scholar?q=truvari). Please post in the [show-and-tell](https://github.com/spiralgenetics/truvari/discussions/categories/show-and-tell) to have your publication added to the list.
+* [A robust benchmark for detection of germline large deletions and insertions](https://www.nature.com/articles/s41587-020-0538-8)
+* [Leveraging a WGS compression and indexing format with dynamic graph references to call structural variants](https://www.biorxiv.org/content/10.1101/2020.04.24.060202v1.abstract)
+* [Duphold: scalable, depth-based annotation and curation of high-confidence structural variant calls](https://academic.oup.com/gigascience/article/8/4/giz040/5477467?login=true)
+* [Parliament2: Accurate structural variant calling at scale](https://academic.oup.com/gigascience/article/9/12/giaa145/6042728)
+* [Learning What a Good Structural Variant Looks Like](https://www.biorxiv.org/content/10.1101/2020.05.22.111260v1.full)
+* [Long-read trio sequencing of individuals with unsolved intellectual disability](https://www.nature.com/articles/s41431-020-00770-0)
+* [lra: A long read aligner for sequences and contigs](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009078)
+* [Samplot: a platform for structural variant visual validation and automated filtering](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02380-5)
+* [AsmMix: A pipeline for high quality diploid de novo assembly](https://www.biorxiv.org/content/10.1101/2021.01.15.426893v1.abstract)
+* [Accurate chromosome-scale haplotype-resolved assembly of human genomes](https://www.nature.com/articles/s41587-020-0711-0)
+* [Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome](https://www.nature.com/articles/s41587-019-0217-9)
+* [NPSV: A simulation-driven approach to genotyping structural variants in whole-genome sequencing data](https://academic.oup.com/bioinformatics/article-abstract/37/11/1497/5466452)
+* [SVIM-asm: structural variant detection from haploid and diploid genome assemblies](https://academic.oup.com/bioinformatics/article/36/22-23/5519/6042701?login=true)
+* [Readfish enables targeted nanopore sequencing of gigabase-sized genomes](https://www.nature.com/articles/s41587-020-00746-x)
+* [stLFRsv: A Germline Structural Variant Analysis Pipeline Using Co-barcoded Reads](https://internal-journal.frontiersin.org/articles/10.3389/fgene.2021.636239/full)
+* [Long-read-based human genomic structural variation detection with cuteSV](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02107-y)
+* [An international virtual hackathon to build tools for the analysis of structural variants within species ranging from coronaviruses to vertebrates](https://f1000research.com/articles/10-246)
+* [Paragraph: a graph-based structural variant genotyper for short-read sequence data](https://link.springer.com/article/10.1186/s13059-019-1909-7)
+* [Genome-wide investigation identifies a rare copy-number variant burden associated with human spina bifida](https://www.nature.com/articles/s41436-021-01126-9)
+* [TT-Mars: Structural Variants Assessment Based on Haplotype-resolved Assemblies](https://www.biorxiv.org/content/10.1101/2021.09.27.462044v1.abstract)
+* [An ensemble deep learning framework to refine large deletions in linked-reads](https://www.biorxiv.org/content/10.1101/2021.09.27.462057v1.abstract)
diff --git a/docs/v3.2.0/Comparing-two-SV-programs.md b/docs/v3.2.0/Comparing-two-SV-programs.md
@@ -0,0 +1,63 @@
+A frequent application of comparing SVs is to perform a 'bakeoff' of performance
+between two SV programs against a single set of base calls.
+
+Beyond looking at the Truvari results/report, you may like to investigate what calls
+are different between the programs.
+
+Below is a set of scripts that may help you generate those results. For our examples,
+we'll be comparing arbitrary programs Asvs and Bsvs aginst base calls Gsvs.
+
+*_Note_* - This assumes that each record in Gsvs has a unique ID in the vcf.
+
+Generate the Truvari report for Asvs and Bsvs
+=============================================
+
+```bash
+truvari bench -b Gsvs.vcf.gz -c Asvs.vcf.gz -o cmp_A/ ...
+truvari bench -b Gsvs.vcf.gz -c Bsvs.vcf.gz -o cmp_B/ ...
+```
+
+Combine the TPs within each report
+==================================
+
+```bash
+cd cmp_A/
+paste <(grep -v "#" tp-base.vcf) <(grep -v "#" tp-call.vcf) > combined_tps.txt
+cd ../cmp_B/
+paste <(grep -v "#" tp-base.vcf) <(grep -v "#" tp-call.vcf) > combined_tps.txt
+```
+
+Grab the FNs missed by only one program
+=======================================
+
+```bash
+(grep -v "#" cmp_A/fn.vcf && grep -v "#" cmp_B/fn.vcf) | cut -f3 | sort | uniq -c | grep "^ *1 " | cut -f2- -d1 > missed_names.txt
+```
+
+Pull the TP sets' difference
+============================
+
+```bash
+cat missed_names.txt | xargs -I {} grep -w {} cmp_A/combined_tps.txt > missed_by_B.txt
+cat missed_names.txt | xargs -I {} grep -w {} cmp_B/combined_tps.txt > missed_by_A.txt
+```
+
+To look at the base-calls that Bsvs found, but Asvs didn't, run `cut -f1-12 missed_by_A.txt`.
+
+To look at the Asvs that Bsvs didn't find, run `cut -f13- missed_by_B.txt`.
+
+Calculate the overlap
+=====================
+
+One may wish for summary numbers of how many calls are shared/unique between the two programs.
+Truvari has a program to help. See [[Consistency-report|consistency]] for details.
+
+Shared FPs between the programs
+===============================
+
+All of the work above has been about how to analyze the TruePositives. If you'd like to see which calls are shared between Asvs and Bsvs that aren't in Gsvs, simply run Truvari again.
+
+```bash
+bgzip cmp_B/fp.vcf && tabix -p vcf cmp_B/fp.vcf.gz
+truvari bench -b cmp_A/fp.vcf -c cmp_B/fp.vcf.gz -o shared_fps ...
+```
diff --git a/docs/v3.2.0/Development.md b/docs/v3.2.0/Development.md
@@ -0,0 +1,94 @@
+# Truvari API
+Many of the helper methods/objects are documented such that developers can reuse truvari in their own code. To see developer documentation, visit [readthedocs](https://truvari.readthedocs.io/en/latest/).
+
+Documentation can also be seen using
+```python
+import truvari
+help(truvari)
+```
+
+# docker
+
+A Dockerfile exists to build an image of Truvari. To make a Docker image, clone the repository and run
+```bash
+docker build -t truvari .
+```
+
+You can then run Truvari through docker using
+```bash
+docker run -v `pwd`:/data -it truvari
+```
+Where `pwd` can be whatever directory you'd like to mount in the docker to the path `/data/`, which is the working directory for the Truvari run. You can provide parameters directly to the entry point.
+```bash
+docker run -v `pwd`:/data -it truvari anno svinfo -i example.vcf.gz
+```
+
+If you'd like to interact within the docker container for things like running the CI/CD scripts
+```bash
+docker run -v `pwd`:/data --entrypoint /bin/bash -it truvari
+```
+You'll now be inside the container and can run FuncTests or run Truvari directly
+```bash
+bash repo_utils/truvari_ssshtests.sh
+truvari anno svinfo -i example.vcf.gz
+```
+
+# CI/CD
+
+Scripts that help ensure the tool's quality. Extra dependencies need to be installed in order to run Truvari's CI/CD scripts. 
+
+```bash
+pip install pylint anybadge coverage
+```
+
+Check code formatting with 
+```bash
+python repo_utils/pylint_maker.py
+```
+We use [autopep8](https://pypi.org/project/autopep8/) (via [vim-autopep8](https://github.com/tell-k/vim-autopep8)) for formatting.
+
+Test the code and generate a coverage report with 
+```bash
+bash repo_utils/truvari_ssshtests.sh
+```
+
+Truvari leverages github actions to perform these checks when new code is pushed to the repository. We've noticed that the actions sometimes hangs through no fault of the code. If this happens, cancel and resubmit the job. Once FuncTests are successful, it uploads an artifact of the `coverage html` report which you can download to see a line-by-line accounting of test coverage.
+
+# git flow
+
+To organize the commits for the repository, we use [git-flow](https://danielkummer.github.io/git-flow-cheatsheet/). Therefore, `develop` is the default branch, the latest tagged release is on `master`, and new, in-development features are within `feature/<name>`
+
+When contributing to the code, be sure you're working off of develop and have run `git flow init`.
+
+# versioning
+
+Truvari uses [Semantic Versioning](https://semver.org/). As of v3.0.0, a single version is kept in the code under `truvari/__init__.__version__`. We try to keep the suffix `-dev` on the version in the develop branch. When cutting a new release, we either replace the suffix with `-rc` if we've built a release candidate that may need more testing/development. Once we've committed to a full release that will be pushed to PyPi, no suffix is placed on the version.
+
+# docs
+
+The github wiki serves the documentation most relevant to the `develop/` branch. When cutting a new release, we freeze and version the wiki's documentation with the helper utility `docs/freeze_wiki.sh`.
+
+# Creating a release
+Follow these steps to create a release
+
+0) Bump release version
+1) Run tests locally
+2) Update API Docs
+3) Freeze the Wiki
+4) Ensure all code is checked in
+5) Do a [git-flow release](https://danielkummer.github.io/git-flow-cheatsheet/)
+6) Use github action to make a testpypi release
+7) Check test release
+```bash
+python3 -m venv test_truvari
+python3 -m pip install --index-url https://test.pypi.org/simple --extra-index-url https://pypi.org/simple/ truvari
+```
+8) Use GitHub action to make a pypi release
+9) Change Updates Wiki
+10) Download release-tarball.zip from step #8’s action
+11) Create release (include #9) from the tag
+12) Checkout develop and Bump to dev version and README ‘commits since’ badge
+
+# conda
+
+The package management we maintain is via PyPi so that the command `pip install truvari` is available. There is a very old version of Truvari that was uploaded to conda that we'll eventually look into maintaining. In the meantime, we recommend reading [this](https://www.anaconda.com/blog/using-pip-in-a-conda-environment) when using conda and trying to use Truvari's documented installation procedure.
diff --git a/docs/v3.2.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md b/docs/v3.2.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md
@@ -0,0 +1,55 @@
+By default, Truvari uses [edlib](https://github.com/Martinsos/edlib) to calculate the edit distance between two SV calls. Optionally, the [Levenshtein edit distance ratio](https://en.wikipedia.org/wiki/Levenshtein_distance) can be used to compute the `--pctsim` between two variants. These measures are different than the sequence similarity calculated by [Smith-Waterman alignment](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm).
+
+To show this difference, consider the following two sequences.:
+
+```
+     AGATACAGGAGTACGAACAGTACAGTACGA
+     |||||||||||||||*||||||||||||||
+ATCACAGATACAGGAGTACGTACAGTACAGTACGA
+
+30bp Aligned
+1bp Mismatched  (96% similarity)
+5bp  Left-Trimmed (~14% of the bottom sequence)
+```
+
+The code below runs swalign, Levenshtein, and edlib to compute the `--pctsim` between the two sequences.
+
+
+```python
+import swalign
+import Levenshtein
+import edlib
+
+seq1 = "AGATACAGGAGTACGAACAGTACAGTACGA"
+seq2 = "ATCACAGATACAGGAGTACGTACAGTACAGTACGA"
+
+scoring = swalign.NucleotideScoringMatrix(2, -1)
+alner = swalign.LocalAlignment(scoring, gap_penalty=-2, gap_extension_decay=0.5)
+aln = alner.align(seq1, seq2)
+mat_tot = aln.matches
+mis_tot = aln.mismatches
+denom = float(mis_tot + mat_tot)
+if denom == 0:
+    ident = 0
+else:
+    ident = mat_tot / denom
+scr = edlib.align(seq1, seq2)
+totlen = len(seq1) + len(seq2)
+
+print('swalign', ident)
+# swalign 0.966666666667
+print('levedit', Levenshtein.ratio(seq1, seq2))
+# levedit 0.892307692308
+print('edlib', (totlen - scr["editDistance"]) / totlen)
+# edlib 0.9076923076923077
+```
+
+Because the swalign procedure only considers the number of matches and mismatches, the `--pctsim` is higher than the edlib and Levenshtein ratio. 
+
+If we were to account for the 5 'trimmed' bases from the Smith-Waterman alignment when calculating the `--pctsim` by counting each trimmed base as a mismatch, we would see the similarity drop to ~83%. 
+
+[This post](https://stackoverflow.com/questions/14260126/how-python-levenshtein-ratio-is-computed) has a nice response describing exactly how the Levenshtein ratio is computed.
+
+The Smith-Waterman alignment is much more expensive to compute compared to the Levenshtein ratio, and does not account for 'trimmed' sequence difference. 
+
+However, edlib is the fastest comparison method and is used by default. Levenshtein can be specified with `--use-lev` in `bench` and `collapse`.