This release bundles a large number of bug fixes (several of which fix silent
correctness issues in training and clustering), removes the deprecated
SemiBin1 command, and includes many documentation improvements.
User-visible changes
- Remove
SemiBin1. OnlySemiBin2is shipped now; if you have old scripts that still callSemiBin1, they will need to be updated. - Accept
hybridas a synonym forlong_readin--sequencing-type(#216) - Remove
--environmentparameter when learning a new model (#218) check_install: now also looks forsamtoolsand reports it (in verbose mode) when missing. This is a soft check: missingsamtoolsdoes not causecheck_installto fail, sincesamtoolsis only required for CRAM input.--allow-missing-mmseqs2is now marked as deprecated in the help text (the option is already a no-op sincemmseqs2is no longer required bycheck_install).- Data-generation log message now distinguishes between generating training data and generating features for inference with a pretrained model (#219)
- Add Python 3.14 to the supported/tested Python versions (SemiBin now runs on Python 3.8–3.14)
- Add the missing
COPYING.MITlicense file (#214)
Bug fixes
single_easy_bin: fixAttributeError: 'Namespace' object has no attribute 'training_type'when running with abundance files (-a) instead of BAM files.set_training_type()was only invoked on the BAM path, so the abundance-based single-sample workflow crashed before doing any work.- Semi-supervised training: fix wrong input in the unlabeled contrastive-learning branch. The second argument to
model.forward()wasunlabeled_train_input1instead ofunlabeled_train_input2, meaning both branches received identical data and the unlabeled-pair contrastive signal was being silently discarded. - Long-read clustering: clip zero depth values before
np.log. Zero depth (no reads mapped) was producing-infvalues, silently corrupting the embedding used for DBSCAN clustering. - Multi-sample binning: fix
split_datamatching sample names that are prefixes of others (e.g.,S1matchingS10:contig). Now usesstr.startswith()rather thanstr.contains(). - Marker calling: validate the cached
markers.hmmoutagainst an input fingerprint (FASTA size/mtime,binned_length, ORF finder) before reusing it. Previously, reruns into an existing output directory could silently reuse stale marker calls from a different input or different parameters. On mismatch a warning is logged and markers are recomputed. - Long-read binning: remove stale output bins from previous runs before writing new results (previously, rerunning into the same output directory with fewer bins left old FASTA files behind).
- Empty FASTA inputs now fail cleanly with
Input file ... is empty. Please check inputs.instead of crashing withZeroDivisionError(single-sample paths) orValueError: attempt to get argmax of an empty sequence(multi-sample paths). - ORF finding: fix prodigal errors being silently ignored.
model_load: raise a clearValueErrorfor unknown model names instead of anUnboundLocalError.check_install: fix incorrect logic when checking forprodigal.- Several minor error-message and validation improvements in
utils.pyandmain.py.
Documentation fixes
- Fix outdated output descriptions in
docs/output.md(default model type, missingbins_info.tsv/contig_bins.tsv, clarifyoutput_binsis the final output) - Fix wrong
split_contigsoutput filename in README strobealign-aemb example (split.fa→split_contigs.fna.gz) - Fix
docs/aemb.mddescription ofsplit_contigs.fna.gzcontents (only split halves, not originals) and an undefined variable in the helper script - Remove references to
--modeand--training-typeflags no longer accepted by SemiBin2 indocs/subcommands.md - Update
docs/semibin2.mdanddocs/index.mdto reflect that onlySemiBin2is installed since v2.2 - Clarify
-b/-aare alternatives (not both required) indocs/subcommands.mdforsingle_easy_bin,generate_sequence_features_single, andgenerate_sequence_features_multi - Fix
docs/generate.mdpath fromscripts/toscript/ - Clarify Prodigal is optional in
docs/install.md; addsamtoolsto the source-install dependency list - Fix step numbering in
docs/usage.md(steps went 1, 3, 4 → now 1, 2, 3) - Make it more prominent in the FAQ and usage docs how to handle hybrid (long+short) data
- Fix Python version ranges in README and install docs
- Fix truncated help text for
--random-seed - Fix wrong DOI for SemiBin2 in
CITATION.md - Remove incorrect
--depth-metabat2mention fromgenerate_sequence_features_singledocs - Remove duplicate HMMER entry and fix malformed Bedtools URL in README
- Many other smaller documentation corrections (typos, grammar, broken links, incorrect option names)
Internal improvements
- Self-supervised training reads each input CSV once instead of every epoch.
- Fix unclosed file handle in
run_prodigal. - Remove unused
pyyamldependency. - Avoid naked
except:clauses (better error reporting and easier debugging). - Write validation errors to the log file (not just to stderr).
- Various small refactors and additional type annotations.