CHISEL splits protein sequence databases into train / test / validation sets, filters sequence pairs for homology, and builds publication-ready benchmark splits. The suite includes:
| Tool | Role |
|---|---|
chisel_build |
Phase 1: sample test/val from input DB, de-overlap splits |
chisel_filter |
Multi-tool filter (pHMMER, MMseqs2, BLAST+, Smith–Waterman) |
chisel_dedup |
Self-deduplicate a FASTA file |
chisel_splitter |
Low-level Profmark-style splitter (used by chisel_build) |
phmmer_filter |
Low-level pHMMER-based pairwise filter |
CHISEL is released under the MIT License (see License).
To cite CHISEL, please use:
[CHISEL_CITATION_PLACEHOLDER]
[CHISEL_PAPER_LINK_PLACEHOLDER]
We thank the developers of HMMER3 1 for the EASEL tools and base pHMMER pipeline customized for chisel_splitter and chisel_filter.
| OS | Build support | Pipeline scripts | External tools | Notes |
|---|---|---|---|---|
| Linux (x86_64) | Official | Official | Official | Primary supported platform. |
| macOS (Apple Silicon / Intel) | Official | Official | Official | Use Homebrew/system packages for dependencies. |
| Windows (WSL2) | Official (via Linux in WSL2) | Official (via Bash in WSL2) | Official (via Linux in WSL2) | Recommended Windows path. |
| Windows (native) | Planned | Planned | Planned | Not currently first-class. |
Core build tools (all supported platforms)
- C compiler (
gcc/clang) - GNU
make ar,ranlib- Bash (for pipeline scripts and install helpers)
pthreadand math library (-lm)
Runtime dependencies for chisel_filter
- MMseqs2
- NCBI BLAST+ (
makeblastdb,blastp) - FASTA package (
ssearch36)
Install external tools via make install-external (see below).
git clone https://github.com/Protein-Sequence-Annotation/chisel.git
cd chisel
./configure # detects CPU (SSE on x86_64, NEON on aarch64/arm64); writes build-config.mk
make./configure always runs under bash (#!/usr/bin/env bash); your login shell does not matter.
Permission denied on scripts? Run make install-scripts once after cloning. make install-external and make test-install do this automatically. You can also invoke scripts as bash configure.
Re-run ./configure after changing machines or to force a backend (./configure --with-impl=sse or --with-impl=neon). make distclean removes build-config.mk and build products.
Executables land in bin/:
| Binary | Role |
|---|---|
chisel_build |
Phase 1: sample test/val, de-overlap splits |
chisel_filter |
Multi-tool filter (p / m / b / s steps) |
chisel_dedup |
Self-deduplication |
chisel_splitter |
Standalone pHMMER splitter |
phmmer_filter |
Standalone pHMMER filter |
Add bin to your PATH or call tools with full paths:
export PATH="/path/to/chisel/bin:$PATH"make install-externalThis installs MMseqs2, BLAST+, and FASTA36 into chisel/external_tools by default.
Linux (install/linux/install_external_linux.sh) selects downloads from uname -m: x86_64 uses MMseqs2 sse2 binaries and NCBI x64-linux BLAST; aarch64/arm64 uses MMseqs2 arm64 and NCBI aarch64-linux BLAST. Override with MMSEQS_ARCH (e.g. avx2) or BLAST_PLATFORM if needed.
macOS: install/macos/install_external_macos.sh prefers Homebrew for MMseqs2 and BLAST+; otherwise falls back to upstream tarballs. FASTA36 is built from source.
Windows (WSL2):
powershell -ExecutionPolicy Bypass -File .\install\windows\install_external_windows.ps1 -ChiselDir C:\path\to\chiselIf external tools are already installed, point CHISEL at them in your config (see Configuration):
| Config key | Set this to |
|---|---|
MMSEQS |
Full path to mmseqs executable |
BLAST_DIR |
Directory containing makeblastdb and blastp |
FASTA_DIR |
Directory containing ssearch36 |
Only the tools listed in --order need to be installed. You can also call the dispatcher directly:
./install/install_external.sh /path/to/chiselmake test-installLinux/macOS runs install/test_installation.sh; Windows routes through WSL2 via install/windows/test_installation_windows.ps1.
Defaults live in install/chisel.config.in. Generate runnable configs with absolute paths:
make config # writes install/chisel.config and install/test_filter.configRe-run after cloning or moving the repo. make all and make install-external also refresh configs when templates change.
Example for running different scripts
| Script | Example |
|---|---|
chisel_build |
chisel_build --config install/chisel.config --input-db seqDB.fasta --output-dir out/ |
chisel_dedup |
chisel_dedup --config install/chisel.config --file test.fasta [--output-dir out/] |
chisel_filter |
chisel_filter --config install/chisel.config --order mbps --fixed-file test.fasta --db-file train.fasta |
Phase-specific commandline options in config file:
| Prefix | Used by |
|---|---|
SPLIT_*, SPLITTER_EXTRA |
chisel_build (splitter step); SPLIT_SEED → --split-seed |
BUILD_FILTER_* |
chisel_build filter steps (CHISEL_PROFILE=build) |
FILTER_* |
standalone chisel_filter |
DEDUP_* |
chisel_dedup |
Common variables: E_VALUE, Z_SIZE, PHMMER_CORES, MMSEQS_CORES, BLAST_CORES, SW_CORES, SW_SHARDS, REMOVE_TARGET (db or fixed), ORDER (for chisel_build).
Log output: chisel_build, chisel_dedup, and chisel_filter print progress summaries on stdout and tool verbose output on stderr. With SLURM, point #SBATCH --output at stdout and #SBATCH --error at stderr.
Generate default config, then run the three-phase workflow:
Phase 1: Build test and validation sets
chisel_build scans the input database only until SPLIT_TEST_LIMIT and SPLIT_VAL_LIMIT are satisfied. Sequences along the way are assigned to train, test, val, or discard; three filter passes then remove cross-set homologs among those outputs. The train.fasta written here contains only train assignments from that prefix of the database—not sequences from the unprocessed remainder.
make config
chisel_build --config install/chisel.config --input-db seqDB.fasta --output-dir results/Outputs: results/test.fasta, results/val.fasta, results/train.fasta (partial), results/discard.fasta.
Phase 2: Self-deduplicate test and validation sets
chisel_dedup --config install/chisel.config --file results/val.fasta
chisel_dedup --config install/chisel.config --file results/test.fastaPhase 3: Grow the training set from remaining sequences
Use chisel_filter on candidate sequences from the rest of the database (everything not already assigned in Phase 1). With REMOVE_TARGET=db, homologs of the fixed test set are removed from the candidate pool.
chisel_filter --config install/chisel.config --order pmbs \
--fixed-file results/test.fasta --db-file train_candidates.fastaA full benchmark split is Phase 1 + Phase 3 (and optionally Phase 2): chisel_build does not scan the entire database for training sequences.
Standalone splitter (low-level pHMMER-based splitter):
chisel_splitter --dbblock 100 --test_limit 20 --val_limit 10 -o stats --output_dir results seqDB.fastaScans the input database until test and validation size limits are reached (SPLIT_TEST_LIMIT, SPLIT_VAL_LIMIT), assigning sequences along the way to train, test, val, or discard. It then runs three chisel_filter passes to remove cross-set homologs among those outputs.
The train.fasta produced here covers only sequences assigned to train from the scanned prefix—it does not represent training sequences from the full database. Use chisel_filter (Phase 3) to add dissimilar training sequences from the remainder.
Final outputs in --output-dir: test.fasta, val.fasta, train.fasta, discard.fasta.
| Option | Description |
|---|---|
--config <file> |
Config file (required) |
--input-db <fasta> |
Input sequence database (required) |
--output-dir <dir> |
Output directory (required) |
Splitter and filter settings come from the config (SPLIT_*, ORDER, BUILD_FILTER_*, etc.).
Runs pHMMER, MMseqs2, BLAST+, and/or Smith–Waterman in the order given by --order. Each tool performs two-pass pruning: search the removal side as query, prune hits, then run the reverse direction against the pruned file before moving to the next tool.
| Option | Description |
|---|---|
--order <string> |
Tool order: p = pHMMER, m = MMseqs2, b = BLAST+, s = Smith–Waterman (ssearch36). Example: pmbs, mbps. |
--config <file> |
Config file |
--fixed-file <fasta> |
Fixed/reference side (e.g. test set) |
--db-file <fasta> |
Database to filter against the fixed set |
| Option | Description |
|---|---|
--out-suffix <name> |
Suffix for per-tool output dirs; defaults to TASK_ID from config |
| Variable | Role |
|---|---|
OUT_DIR |
Base directory for outputs (required) |
REMOVE_TARGET |
db or fixed — which side loses sequences after hits |
PHMMER_FILTER, MMSEQS, BLAST_DIR, FASTA_DIR |
Tool paths |
E_VALUE, Z_SIZE |
E-value threshold and database size calibration |
PHMMER_CORES, MMSEQS_CORES, BLAST_CORES, SW_CORES |
Thread counts |
SW_SHARDS |
Parallel ssearch36 shards (SW_CORES must be divisible by SW_SHARDS) |
FILTER_PHMMER_PHIGH, FILTER_PHMMER_PLOW, FILTER_PHMMER_QSIZE, FILTER_PHMMER_EXTRA |
pHMMER tuning (standalone) |
FILTER_MMSEQS_*, FILTER_BLASTP_EXTRA, FILTER_SW_EXTRA |
Per-tool extras |
See src/chisel_filter.sh for full defaults and behavior.
Removes within-file homologs using phmmer_filter with --no_self. Writes <stem>_dedup.fasta (e.g. test.fasta → test_dedup.fasta).
| Option | Description |
|---|---|
--config <file> |
Config file (required) |
--file <fasta> |
Input FASTA (required) |
--output-dir <dir> |
Output directory (default: same directory as input) |
Tuning via DEDUP_PHIGH, DEDUP_PLOW, DEDUP_QSIZE, DEDUP_EXTRA in config.
Low-level splitter for one input FASTA into train / test / val / discard. Used internally by chisel_build; call directly for custom split workflows.
| Argument | Description |
|---|---|
<seqdb> |
Input protein sequence file (positional; must be last) |
| Option | Default | Description |
|---|---|---|
-o <prefix> |
- |
Prefix for stats / summary output files |
-Z <n> |
inferred from --dbblock |
Effective database size for E-value calculation |
--cpu <n> |
1 |
Worker threads |
--dbblock <n> |
10000 |
Sequences per database block |
--test_limit <n> |
500 |
Minimum test sequences before stopping |
--val_limit <n> |
100 |
Minimum validation sequences |
--init_chunk <n> |
50 |
Sequences considered per assignment round |
--split-seed <n> |
0 |
RNG seed for train/test/val assignment (0 = random each run) |
--seed <n> |
42 |
RNG seed for internal pHMMER pipeline (0 = arbitrary) |
--suppress |
off | Disable progress bar |
--task_id <id> |
0 |
Suffix for output files (*_0.fasta, etc.) |
--output_dir <dir> |
— | Write train/test/val/discard under <dir> |
-E <x> |
0.01 |
E-value threshold for significant hits |
--plow, --phigh |
0.0 |
PID window for accepting sequences |
For all options: chisel_splitter -h.
Standalone pHMMER-based pairwise filter. Used internally by chisel_filter and chisel_dedup.
| Argument | Description |
|---|---|
<qdb> |
Query sequence database (first positional) |
<tdb> |
Target sequence database (second positional) |
One of qdb or tdb may be - (stdin), not both.
| Option | Default | Description |
|---|---|---|
-o <prefix> |
- |
Output prefix for result files |
-Z <n> |
inferred | Database size for E-value calibration |
-E <x> |
0.01 |
Reporting E-value threshold |
--cpu <n> |
1 |
Threads |
--qsize <n> |
1 |
Queries per thread per batch |
--format <n> |
1 |
Output format (see below) |
--all_hits |
off | Report all hits, not just first failure |
--no_self |
off | Ignore self-comparison (used by chisel_dedup) |
--plow, --phigh |
0.0 |
PID limits for accepting sequences |
--seed <n> |
42 |
RNG seed for internal pHMMER pipeline (0 = arbitrary) |
| Value | Meaning |
|---|---|
0 |
Per-sequence ACCEPT/REJECT string |
1 |
Full information for rejected hits (default) |
2 |
IDs of accepted sequences |
3 |
IDs of rejected queries |
4 |
IDs of rejected targets (with --all_hits) |
For all options: phmmer_filter -h.
-
Benchmark test and validation sets — Run
chisel_buildwith tunedSPLIT_TEST_LIMIT,SPLIT_VAL_LIMIT, andSPLIT_CPUto sample and de-overlap test/val from the head of a database. Then usechisel_filteron remaining sequences to grow the training set. -
Filter training candidates against a fixed test set —
chisel_filterwithREMOVE_TARGET=dband strictE_VALUE/Z_SIZE. -
Remove overlap between two FASTA sets — point
--fixed-fileand--db-fileat the two pools; setREMOVE_TARGETto drop hits from either side. -
Out-of-distribution evaluation — use
chisel_buildfor held-out test/val, thenchisel_filterorphmmer_filterto strip homologs from training candidates. -
Within-set deduplication —
chisel_dedupon a FASTA file, orchisel_filter/phmmer_filterwith the same file as both query and target.
chisel_splitter -h
phmmer_filter -h
chisel_build -h
chisel_filter -h
chisel_dedup -h(-h and --help are equivalent for the pipeline scripts.)
make install-external builds FASTA36 from upstream wrpearson/fasta36 with a GCC compatibility patch (install/patches/fasta36-gcc-prototypes.patch). Output: external_tools/fasta36/bin/ssearch36.
Optional overrides: FASTA36_REPO, FASTA36_REF (see install/fasta36_install.sh).
Upstream FASTA36 predates strict ISO C defaults on recent GCC/Clang. Legacy mode requires git, patch, and make on PATH. If upstream changes the patched files, regenerate the diff and retry make install-external.
- Eddy SR (2011) Accelerated Profile HMM Searches. PLOS Computational Biology 7(10): e1002195. https://doi.org/10.1371/journal.pcbi.1002195
This project is licensed under the MIT License. See LICENSE for the full text.