-
Notifications
You must be signed in to change notification settings - Fork 0
Indexing schemes
MiST requires an indexed scheme to run queries. An index is created from a set of FASTA files containing allele sequences. Optionally, a profiles file (e.g., MLST or cgMLST profiles in TSV format) can also be provided to link allele combinations with sequence types.
FASTA input can be provided in two ways:
- As command line arguments (one or more FASTA files, supports wildcards)
- Via a text file listing FASTA file paths with
--fasta-list
Each FASTA file must contain the sequences for a single locus. Sequence headers should follow one of the following formats:
>{locus_name}_{allele_id} (e.g., abcZ_1, abcZ_2, ...)
>{allele_id} (e.g., 1, 2, 3, ...)
Profiles must be provided in TSV format. The loci specified in the profiles must match the corresponding FASTA files. Additional metadata (e.g., clonal complex) can be included in extra columns.
ST dnaA fusA gyrB leuS pyrG rplB rpoB clonal_complex
1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4
5 5 5 2 5 5 5 5
...
mist index abcZ.fasta adk.fasta aroE.fasta fumC.fasta gdh.fasta -o mlst_neisseria mist index *.fasta --profiles profiles.tsv -o mlst_neisseriamist index --fasta-list fasta_files.txt --profiles profiles.tsv -o mlst_neisseria --threads 16 -l, --fasta-list PATH List with input FASTA path(s)
-p, --profiles PATH TSV file with profiles
-o, --output PATH Output directory [required]
-c, --cutoff INTEGER Clustering cutoff [default: 95]
-t, --threads INTEGER Number of threads to use [default: 1]
--debug Enable debug mode
--log PATH Save log to this file
--help Show this message and exit
By default, sequences are clustered at 95% identity. This threshold ensures accurate allele calling while reducing redundancy.
- Lowering the cutoff (minimum 80%) can speed up indexing and querying
- However, reducing the cutoff may decrease allele calling accuracy
The output directory contains one subdirectory per locus, along with several reference files.
mlst_neisseria/
├── abcZ/
├── adk/
├── aroE/
├── fumC/
├── gdh/ # One directory per locus
├── pdhC/
├── pgm/
├── loci_repr.fasta # Representative allele sequences
├── loci_repr.fasta.mni # Minimap2 index of representative alleles
└── loci.txt # List of all locus names
Within each locus directory:
abcZ/
├── abcZ.fasta # All allele sequences
├── abcZ-clustered.fasta # Representative sequences per cluster
├── abcZ-clustered.fasta.clstr # CD-HIT cluster assignments
└── mist_db.json # MiST database file