[experimental] option to estimate alignment score parameters from input pangenome data#1511
Merged
glennhickey merged 15 commits intomasterfrom Nov 11, 2024
Merged
[experimental] option to estimate alignment score parameters from input pangenome data#1511glennhickey merged 15 commits intomasterfrom
glennhickey merged 15 commits intomasterfrom
Conversation
8806f8c to
14b1d71
Compare
14b1d71 to
a1a31ec
Compare
eb92f0f to
1a0cb59
Compare
0782133 to
d2f270c
Compare
…tus into gpu-lastz
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
All multiple alignment done by cactus (in BAR) use the scoring matrix from the config. This is currently set to
These are derived from the defaults from lastz which cites it has the "HOXD70" matrix from
Chiaromonte F, Yap VB, Miller W (2002). Scoring pairwise genomic sequence alignments. Pacific Symposium on Biocomputing 7:115-126.But this matrix is very forgiving for substitutions, as you might expect for one designed to align very diverged genomes. For example, you can have a net-positive score by alternating between matches and transversions.
The upshot is that we can see cactus aligning right through all sorts of crazy regions, producing extremely messy graphs, wrong-looking graphs to the detriment of downstream aligning.
So this PR adds the
--lastTrainoption tocactus-pangenome(andcactus-minigraph). This option will uselast-trainfrom the last software package to estimate scores from the data (using the reference genome and most diverged genome to it from the input). These scores will be saved in last's.trainformat, and then imported into the cactus config before alignment. (cactus-aligngets a--scoresFileoption to take in the file too).Since
lastonly trains one gap model andabPOAwants two (and seems to crash if you don't give it one), a scaling factor (defaulting to 3 in the config) is used to create a long gap model that's more expensive to open but cheaper to extend.The models learned using some test data are radically different than the HOX70 matrix, especially in that the penalize mismatches much more, which is evidence that the HOX70 isn't appropriate for most pangenome data.
I'd like to eventually incorporate something like this in progressive cactus but for now it's only for pangenomes. And it needs some more testing before it can be merged...