-
Notifications
You must be signed in to change notification settings - Fork 0
Indexing schemes
MiST requires an indexed scheme to run queries. An index is created from a set of FASTA files containing allele sequences. Optionally, a profiles file (e.g., MLST or cgMLST profiles in TSV format) can also be provided to link allele combinations with sequence types.
FASTA input can be provided in two ways:
- Directly with
--fasta(one or more FASTA files, supports wildcards) - Via a text file listing FASTA file paths with
--fasta-list
Each FASTA file must contain the sequences for a single locus. Sequence headers should follow one of the following formats:
>{locus_name}_{allele_id} (e.g., abcZ_1, abcZ_2, ...)
>{allele_id} (e.g., 1, 2, 3, ...)
Profiles must be provided in TSV format. The loci specified in the profiles must match the corresponding FASTA files. Additional metadata (e.g., clonal complex) can be included in extra columns.
ST dnaA fusA gyrB leuS pyrG rplB rpoB clonal_complex
1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4
5 5 5 2 5 5 5 5
...
mist_index --fasta abcZ.fasta adk.fasta aroE.fasta fumC.fasta gdh.fasta -o mlst_neisseria mist_index --fasta *.fasta --profiles profiles.tsv -o mlst_neisseriamist_index --fasta-list fasta_files.txt --profiles profiles.tsv -o mlst_neisseria --threads 16-h, --help Show this help message and exit
-f FASTA [FASTA ...], --fasta FASTA [FASTA ...]
Input FASTA path(s)
-l FASTA_LIST, --fasta-list FASTA_LIST
List with input FASTA path(s)
-p PROFILES, --profiles PROFILES
TSV file with profiles
-o OUTPUT, --output OUTPUT
Output directory
-c CUTOFF, --cutoff CUTOFF
Clustering cutoff
-t THREADS, --threads THREADS
Nb. of threads to use
--log Save log to 'mist.log' in the output directory
--debug Enable debug logging
--version Print version and exit
By default, sequences are clustered at 95% identity. This threshold ensures accurate allele calling while reducing redundancy.
- Lowering the cutoff (minimum 80%) can speed up indexing and querying
- However, reducing the cutoff may decrease allele calling accuracy
The output directory contains one subdirectory per locus, along with several reference files.
mlst_neisseria/
├── abcZ/
├── adk/
├── aroE/
├── fumC/
├── gdh/ # One directory per locus
├── pdhC/
├── pgm/
├── loci_repr.fasta # Representative allele sequences
├── loci_repr.fasta.mni # Minimap2 index of representative alleles
├── loci.txt # List of all locus names
└── mist_db.json # Database JSON (used by MiST)
Within each locus directory:
abcZ/
├── abcZ.fasta # All allele sequences
├── abcZ-clustered.fasta # Representative sequences per cluster
├── abcZ-clustered.fasta.clstr # CD-HIT cluster assignments