Skip to content

Synteny matcher silently drops true positives when paralogous HMMs cross-hit the same peptide #96

@Robaina

Description

@Robaina

When the HMM set passed to pynteny search contains paralogous models (e.g. nifD/nifK, nifE/nifN, bchL/nifH, V-/Mo-/Fe-only nitrogenase families), the same peptide commonly scores above threshold under more than one HMM. Pynteny keeps every hit and emits the warning

WARNING: At least two different HMMs produced identical sequence hits

…but the rolling-window pattern matcher in SyntenyHMMfilter does not deduplicate them. The duplicate rows are interleaved (by (contig, gene_pos) sort) between the rows of true syntenic neighbours, breaking the position-diff check inside the window. The result is a silent false negative: a genuine syntenic operon is dropped from synteny_matched.tsv with no error and no exit code.

This is the most common gotcha when users curate paralog-aware HMM panels — which is exactly the use case pynteny is best suited for.

Reproducer

Public dataset: 20 TARA Oceans MAGs from PRJEB97970 (or any marine MAG panel) + the PGAP-derived TIGRFAM HMM set restricted to {TIGR01287.1 (nifH), TIGR01282.1 (nifD), TIGR01286.1 (nifK)}.

pynteny build  --data mags/ --outfile peptides.faa --prepend-filename
pynteny search --synteny_struc '>TIGR01287.1 0 >TIGR01282.1 0 >TIGR01286.1' \
               --data peptides.faa --hmm_dir hmm_profiles/ --outdir out/

The MAG TARA_ION_45_MAG_00019 carries a textbook syntenic nifH→nifD→nifK at positions 33→34→35 of a single contig, all positive strand, bitscores 452.7 / 856.5 / 861.6. It does not appear in out/synteny_matched.tsv. The warning fires.

Mechanism

After get_all_HMM_hits(), the combined hits dataframe for that contig contains (rows abbreviated):

row pos HMM (code) source
1 33 nifH (0) TIGR01287.1 hit
2 34 nifD (1) TIGR01282.1 hit
3 34 nifK (2) TIGR01286.1 cross-hit on the nifD peptide
4 35 nifD (1) TIGR01282.1 cross-hit on the nifK peptide
5 35 nifK (2) TIGR01286.1 hit
6 36 nifD (1) TIGR01282.1 cross-hit on the nifE peptide
7 36 nifK (2) TIGR01286.1 cross-hit on the nifE peptide
8 37 nifK (2) TIGR01286.1 cross-hit on the nifN peptide

The window-of-3 scan in filter.py:316-327 looks for code-tuple (0, 1, 2) AND gene-position diffs of exactly (1, 1).

  • Window (rows 1,2,3) → codes (0,1,2) ✓ but position diffs (33→34=1, 34→34=0)
  • Window (rows 1,2,4) doesn't exist; the scan is contiguous over the sorted dataframe.
  • Every subsequent window starts inside the duplicate region and has the wrong code tuple.

So (nifH@33, nifD@34, nifK@35) is never visited as a window even though the bitscore-best assignment at each position is exactly the canonical operon. The warning at filter.py:294-296 detects the duplicate condition but is informational only.

Proposed change

1. Opt-in per-peptide deduplication (--best-hmm-wins)

Add a single deduplication pass inside SyntenyHMMfilter.get_all_HMM_hits(), immediately after the pd.concat of per-HMM frames and before the rolling-window logic:

if self._best_hmm_wins:
    all_hit_labels = (
        all_hit_labels
        .sort_values("bitscore", ascending=False)
        .drop_duplicates(subset=["full_label"], keep="first")
        .sort_values(["contig", "gene_pos"])
        .reset_index(drop=True)
    )

This requires propagating bitscore from HMMER.parse_HMM_search_output() (already collected, just not currently surfaced into get_all_HMM_hits) and threading a --best-hmm-wins flag from CLI → synteny_searchfilter_FASTA_by_synteny_structureSyntenyHMMfilter.__init__.

Default for 1.x: off (preserves existing behaviour for users with curated panels that intentionally rely on multi-HMM membership via |-groups). Default for 2.0: flip to on with a sibling --keep-all-hmm-hits escape hatch, since this is what most users actually want.

2. Upgrade the warning

Replace the current line in filter.py:

logger.warning("At least two different HMMs produced identical sequence hits")

with something actionable:

n_dup = all_hit_labels.full_label.duplicated().sum()
affected_pairs = (
    all_hit_labels[all_hit_labels.full_label.duplicated(keep=False)]
    .groupby("full_label").hmm.apply(lambda s: tuple(sorted(set(s))))
    .value_counts().head(5).to_dict()
)
logger.warning(
    f"{n_dup} peptides hit by >=2 HMMs (paralog cross-hits). "
    f"Top HMM pairs: {affected_pairs}. "
    "Syntenic patterns adjacent to these peptides may be silently dropped. "
    "Re-run with --best-hmm-wins to deduplicate by bitscore."
)

3. Docs

Add a short note to the user guide and to pynteny search --help:

Pynteny's synteny matcher does not disambiguate paralogous HMMs by default. If your panel contains close paralogs (e.g. nitrogenase α/β, MoFe-co biosynthesis nifE/N, light-independent protochlorophyllide reductase bchLNB), prefer either (a) a disjoint orthogonal HMM set, (b) |-grouped HMMs in the synteny structure, or (c) --best-hmm-wins to assign each peptide to its highest-scoring HMM.

What I would not change

  • The rolling-window matcher itself is correct on deduplicated input — no logic change needed there.
  • The "all peptides duplicated" hard sys.exit(1) at filter.py:297-304 is the right failsafe for the degenerate case and should stay.
  • The default behaviour, until a 2.0 release. Silent dedup would change the output for users who curate paralog panels intentionally.

Acceptance

  • A user running pynteny search --best-hmm-wins ... on the reproducer above gets TARA_ION_45_MAG_00019 in synteny_matched.tsv.
  • The current default behaviour is unchanged.
  • The updated warning prints the count and at least one affected HMM pair.
  • The user guide carries the paralog note.
  • A regression test exercises the duplicate-row case on a minimal synthetic dataset.

Related

  • Warning origin: src/pynteny/filter.py:293-304
  • Window scan: src/pynteny/filter.py:316-327
  • Existing |-group merge: SyntenyHMMfilter._merge_hits_by_HMM_group — covers the case where the user declares the HMM group in the synteny structure, but not the case where cross-hits are incidental.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions