Skip to content

04 VCF Normalization

Tim Dunn edited this page Feb 9, 2024 · 1 revision

Following variant clustering, variants are optionally realigned by selecting the --realign-query and/or --realign-truth options. The --realign-only flag can be used to skip downstream evaluations.

Best-Alignment Normalization

As initially introduced in this manuscript and further explored in our work, best alignment normalization can be used to select between several possible variant representations when complex variants are involved. Affine gap Smith Waterman alignment is used to select the "best" variant representation, defined by a given set of alignment parameters. The design space for these parameters (m, x, o, e) is shown below, with many common alignment tools plotted and four example (A,B,C,D) alignments with their resulting variant representations. By default, the representation selected by vcfdist is at Point C.

the affine-gap design space for variant representation

Standard VCF Normalization

The traditional method of variant normalization involves decomposing complex variants, trimming unnecessary bases from the variant representation, and then left-aligning INDELs.

variant decomposition, trimming, and left-shifting

This procedure is sufficient to create a unique canonical representation for a single variant, but not when multiple or complex variants are involved.