[experimental] option to estimate alignment score parameters from input pangenome data by glennhickey · Pull Request #1511 · ComparativeGenomicsToolkit/cactus

glennhickey · 2024-10-28T16:26:39Z

All multiple alignment done by cactus (in BAR) use the scoring matrix from the config. This is currently set to

partialOrderAlignmentSubMatrix="91 -114 -61 -123 -100 -114 100 -125 -61 -100 -61 -125 100 -114 -100 -123 -61 -114 91 -100 -100 -100 -100 -100 100"
partialOrderAlignmentGapOpenPenalty1="400"
partialOrderAlignmentGapExtensionPenalty1="30"			
partialOrderAlignmentGapOpenPenalty2="1200"

These are derived from the defaults from lastz which cites it has the "HOXD70" matrix from Chiaromonte F, Yap VB, Miller W (2002). Scoring pairwise genomic sequence alignments. Pacific Symposium on Biocomputing 7:115-126.

But this matrix is very forgiving for substitutions, as you might expect for one designed to align very diverged genomes. For example, you can have a net-positive score by alternating between matches and transversions.

The upshot is that we can see cactus aligning right through all sorts of crazy regions, producing extremely messy graphs, wrong-looking graphs to the detriment of downstream aligning.

So this PR adds the --lastTrain option to cactus-pangenome (and cactus-minigraph). This option will use last-train from the last software package to estimate scores from the data (using the reference genome and most diverged genome to it from the input). These scores will be saved in last's .train format, and then imported into the cactus config before alignment. (cactus-align gets a --scoresFile option to take in the file too).

Since last only trains one gap model and abPOA wants two (and seems to crash if you don't give it one), a scaling factor (defaulting to 3 in the config) is used to create a long gap model that's more expensive to open but cheaper to extend.

The models learned using some test data are radically different than the HOX70 matrix, especially in that the penalize mismatches much more, which is evidence that the HOX70 isn't appropriate for most pangenome data.

I'd like to eventually incorporate something like this in progressive cactus but for now it's only for pangenomes. And it needs some more testing before it can be merged...

…mes)

…tus into gpu-lastz

glennhickey force-pushed the gpu-lastz branch from 8806f8c to 14b1d71 Compare October 29, 2024 16:11

glennhickey mentioned this pull request Oct 29, 2024

Backtrack errors when trying different scoring matrices yangao07/abPOA#79

Open

glennhickey force-pushed the gpu-lastz branch from 14b1d71 to a1a31ec Compare October 30, 2024 16:38

glennhickey added 2 commits October 31, 2024 09:17

add option to use last to learn scoring params from data (for pangeno…

7455227

…mes)

update abpoa for overflow patch

1a0cb59

glennhickey force-pushed the gpu-lastz branch from eb92f0f to 1a0cb59 Compare October 31, 2024 13:17

bump up abpoa banding

d2f270c

glennhickey force-pushed the gpu-lastz branch from 0782133 to d2f270c Compare November 1, 2024 17:51

glennhickey added 12 commits November 1, 2024 16:54

Merge remote-tracking branch 'origin/master' into gpu-lastz

6e4145c

add option to use last to learn scoring params from data (for pangeno…

805f135

…mes)

update abpoa for overflow patch

8c2e024

bump up abpoa banding

29a21ea

Merge branch 'gpu-lastz' of github.com:ComparativeGenomicsToolkit/cac…

c0141ba

…tus into gpu-lastz

add --scoresFile to cactus-pangenome; doc blurb

4e8d994

use lowest instead of avg scores for N

dd79741

flush output buffer

1d686e6

update vg to 1.61

1aaf2d7

oops

a7855aa

Merge remote-tracking branch 'origin/master' into gpu-lastz

0a15389

fix version

2b698aa

glennhickey merged commit 58e04f9 into master Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[experimental] option to estimate alignment score parameters from input pangenome data#1511

[experimental] option to estimate alignment score parameters from input pangenome data#1511
glennhickey merged 15 commits intomasterfrom
gpu-lastz

glennhickey commented Oct 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

glennhickey commented Oct 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant