Skip to content

PickyBinders/steam

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

STEAM — Search with TEA against Many

STEAM is heavily adapted from Foldseek (Van Kempen et al., Nature Biotechnology 2024), replacing Foldseek's 3Di structural alphabet with TEA. This means STEAM can be applied to any protein sequence, no 3D structure required. Like Foldseek, STEAM is built on the MMseqs2 framework.

Requirements

  • CMake >= 3.15
  • GCC >= 7 or Clang
  • For TEA sequence generation: TEA (pip install git+https://github.com/PickyBinders/tea.git)

Installation

# Install build dependencies (if needed)
mamba install -c conda-forge cmake gxx_linux-64

# Build
git clone --recursive https://github.com/PickyBinders/steam.git
cd steam
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j4

The binary will be at build/src/steam.

Quick start

1. Generate TEA sequences

Convert amino acid sequences to the TEA structural alphabet using the tea_convert tool:

tea_convert -f proteins.fasta -o proteins_tea.fasta

This requires a GPU and the TEA package. The output is a FASTA file with TEA sequences in the same order as the input.

2. Search

steam easy-search query_tea.fasta query_aa.fasta \
                   target_tea.fasta target_aa.fasta \
                   result.m8 tmp

Useful flags:

Flag Default Notes
-e 100 E-value threshold
--max-seqs 2000 Maximum results per query from prefiltering

3. Cluster

easy-cluster runs cascaded clustering (sensitive) and easy-linclust runs linear-time clustering (faster, less sensitive). Both take paired TEA/AA FASTAs:

# Cascaded clustering
steam easy-cluster proteins_tea.fasta proteins_aa.fasta clusterResult tmp

# Linear-time clustering (large datasets)
steam easy-linclust proteins_tea.fasta proteins_aa.fasta clusterResult tmp

Outputs three files alongside clusterResult:

  • clusterResult_cluster.tsv<representative> <member> adjacency list
  • clusterResult_rep_seq.fasta — one AA sequence per cluster representative
  • clusterResult_all_seqs.fasta — FASTA grouped by cluster

Useful flags:

Flag Default Notes
--min-seq-id 0 Minimum sequence identity for cluster members
-c 0.8 Minimum coverage
--cov-mode 0 0=bidirectional, 1=target, 2=query
--cluster-reassign off Cascaded only: corrects criteria-violations from cascaded merging
--single-step-cluster off Cascaded only: skip cascading, single pass

Commands

Command Description
easy-search Search FASTA pairs against FASTA pairs or a pre-built database
easy-cluster Cluster paired TEA/AA FASTAs (cascaded, sensitive)
easy-linclust Cluster paired TEA/AA FASTAs (linear-time, faster)
createdb Create a STEAM database from paired TEA/AA FASTA files
search Search pre-built databases (faster for repeated searches)
cluster Cluster a pre-built database (cascaded)
linclust Cluster a pre-built database (linear-time)
convertalis Convert alignment results to various output formats
createsubdb Subset a STEAM database (keeps _aa companion in sync)

Database workflow

For searching or clustering the same database multiple times, pre-build it:

# Create database (one time)
steam createdb target_tea.fasta target_aa.fasta targetDB

# Search against pre-built database (fast, repeatable)
steam easy-search query_tea.fasta query_aa.fasta targetDB result.m8 tmp

# Cluster the pre-built database
steam cluster targetDB clusterDB tmp

Output format

Default BLAST-tab format (same as MMseqs2/BLAST -outfmt 6):

query  target  fident  alnlen  mismatch  gapopen  qstart  qend  tstart  tend  evalue  bits

Custom output with --format-output adds TEA-specific output columns:

Column Description
tfident TEA fractional identity
tpident TEA percent identity
qteaseq Query TEA full sequence
tteaseq Target TEA full sequence
qteaaln Query TEA aligned sequence
tteaaln Target TEA aligned sequence

Standard MMseqs2 output columns (fident, alnlen, qcov, tcov, evalue, raw, bits, etc.) are also available.

Scoring

The alignment score at each position is the sum of:

  • MATCHA score: substitution score from the TEA alphabet matrix
  • AA score: BLOSUM62 substitution score, weighted by --aa-weight (default 1.4)

E-value computation

STEAM uses a log-linear E-value model following Edgar & Sahakyan (2025). E-values are computed as:

E(s) = (H/Q) * 10^(m*s + c)

where s is the raw alignment score, H/Q is the average number of reported hits per query (computed at runtime from prefilter results), and m and c are parameters fitted on SCOP40c.

About

Search with TEA against Many

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors