Skip to content

Indexing schemes

BertBog edited this page Sep 4, 2025 · 2 revisions

MiST requires an indexed scheme to run queries. An index is created from a set of FASTA files containing allele sequences. Optionally, a profiles file (e.g., MLST or cgMLST profiles in TSV format) can also be provided to link allele combinations with sequence types.

FASTA input can be provided in two ways:

  • Directly with --fasta (one or more FASTA files, supports wildcards)
  • Via a text file listing FASTA file paths with --fasta-list

Input format

FASTA files

Each FASTA file must contain the sequences for a single locus. Sequence headers should follow one of the following formats:

>{locus_name}_{allele_id}   (e.g., abcZ_1, abcZ_2, ...)
>{allele_id}                (e.g., 1, 2, 3, ...)

Profiles

Profiles must be provided in TSV format. The loci specified in the profiles must match the corresponding FASTA files. Additional metadata (e.g., clonal complex) can be included in extra columns.

ST      dnaA    fusA    gyrB    leuS    pyrG    rplB    rpoB    clonal_complex
1       1       1       1       1       1       1       1
2       2       2       2       2       2       2       2
3       3       3       3       3       3       3       3
4       4       4       4       4       4       4       4
5       5       5       2       5       5       5       5
...

Example usage

Multiple FASTA files (no profiles)

mist_index --fasta abcZ.fasta adk.fasta aroE.fasta fumC.fasta gdh.fasta -o mlst_neisseria 

Wildcard FASTA input with profiles TSV

mist_index --fasta *.fasta --profiles profiles.tsv -o mlst_neisseria

FASTA list with 16 threads

mist_index --fasta-list fasta_files.txt --profiles profiles.tsv -o mlst_neisseria --threads 16

Options

-h, --help          Show this help message and exit
-f FASTA [FASTA ...], --fasta FASTA [FASTA ...]
                    Input FASTA path(s)
-l FASTA_LIST, --fasta-list FASTA_LIST
                    List with input FASTA path(s)
-p PROFILES, --profiles PROFILES
                    TSV file with profiles
-o OUTPUT, --output OUTPUT
                    Output directory
-c CUTOFF, --cutoff CUTOFF
                    Clustering cutoff
-t THREADS, --threads THREADS
                    Nb. of threads to use
--log               Save log to 'mist.log' in the output directory
--debug             Enable debug logging
--version           Print version and exit

Clustering cut-off

By default, sequences are clustered at 95% identity. This threshold ensures accurate allele calling while reducing redundancy.

  • Lowering the cutoff (minimum 80%) can speed up indexing and querying
  • However, reducing the cutoff may decrease allele calling accuracy

Output structure

The output directory contains one subdirectory per locus, along with several reference files.

mlst_neisseria/
├── abcZ/                 
├── adk/
├── aroE/
├── fumC/
├── gdh/                  # One directory per locus
├── pdhC/
├── pgm/
├── loci_repr.fasta       # Representative allele sequences
├── loci_repr.fasta.mni   # Minimap2 index of representative alleles
├── loci.txt              # List of all locus names
└── mist_db.json          # Database JSON (used by MiST)

Within each locus directory:

abcZ/
├── abcZ.fasta              # All allele sequences
├── abcZ-clustered.fasta    # Representative sequences per cluster
├── abcZ-clustered.fasta.clstr # CD-HIT cluster assignments

Clone this wiki locally