Skip to content

Indexing schemes

Bert Bogaerts edited this page Sep 23, 2025 · 2 revisions

MiST requires an indexed scheme to run queries. An index is created from a set of FASTA files containing allele sequences. Optionally, a profiles file (e.g., MLST or cgMLST profiles in TSV format) can also be provided to link allele combinations with sequence types.

FASTA input can be provided in two ways:

  • As command line arguments (one or more FASTA files, supports wildcards)
  • Via a text file listing FASTA file paths with --fasta-list

Input format

FASTA files

Each FASTA file must contain the sequences for a single locus. Sequence headers should follow one of the following formats:

>{locus_name}_{allele_id}   (e.g., abcZ_1, abcZ_2, ...)
>{allele_id}                (e.g., 1, 2, 3, ...)

Profiles

Profiles must be provided in TSV format. The loci specified in the profiles must match the corresponding FASTA files. Additional metadata (e.g., clonal complex) can be included in extra columns.

ST      dnaA    fusA    gyrB    leuS    pyrG    rplB    rpoB    clonal_complex
1       1       1       1       1       1       1       1
2       2       2       2       2       2       2       2
3       3       3       3       3       3       3       3
4       4       4       4       4       4       4       4
5       5       5       2       5       5       5       5
...

Example usage

Multiple FASTA files (no profiles)

mist index abcZ.fasta adk.fasta aroE.fasta fumC.fasta gdh.fasta -o mlst_neisseria 

Wildcard FASTA input with profiles TSV

mist index *.fasta --profiles profiles.tsv -o mlst_neisseria

FASTA list with 16 threads

mist index --fasta-list fasta_files.txt --profiles profiles.tsv -o mlst_neisseria --threads 16

Options

  -l, --fasta-list PATH  List with input FASTA path(s)
  -p, --profiles PATH    TSV file with profiles
  -o, --output PATH      Output directory  [required]
  -c, --cutoff INTEGER   Clustering cutoff  [default: 95]
  -t, --threads INTEGER  Number of threads to use  [default: 1]
  --debug                Enable debug mode
  --log PATH             Save log to this file
  --help                 Show this message and exit

Clustering cut-off

By default, sequences are clustered at 95% identity. This threshold ensures accurate allele calling while reducing redundancy.

  • Lowering the cutoff (minimum 80%) can speed up indexing and querying
  • However, reducing the cutoff may decrease allele calling accuracy

Output structure

The output directory contains one subdirectory per locus, along with several reference files.

mlst_neisseria/
├── abcZ/                 
├── adk/
├── aroE/
├── fumC/
├── gdh/                  # One directory per locus
├── pdhC/
├── pgm/
├── loci_repr.fasta       # Representative allele sequences
├── loci_repr.fasta.mni   # Minimap2 index of representative alleles
└── loci.txt              # List of all locus names

Within each locus directory:

abcZ/
├── abcZ.fasta                 # All allele sequences
├── abcZ-clustered.fasta       # Representative sequences per cluster
├── abcZ-clustered.fasta.clstr # CD-HIT cluster assignments
└── mist_db.json               # MiST database file 

Clone this wiki locally