-
Notifications
You must be signed in to change notification settings - Fork 0
Running MiST
MiST is a command-line tool used for allele calling and (cg)MLST profiling. It compares input assemblies or contigs in FASTA format to an indexed (cg)MLST database and reports the best-matching alleles and profiles, as well as potential novel alleles.
The only required options are --db and --fasta.
mist call --db neisseria/mlst --fasta input_contigs.fastamist call \
--db neisseria/mlst \
--fasta input_contigs.fasta \
--out-tsv alleles.tsv \
--out-dir results \
--threads 8The following options are available:
Usage: mist call [OPTIONS]
Calls alleles from a FASTA file.
Options:
-f, --fasta FILE Input FASTA path [required]
-d, --db PATH Database path [required]
-o, --out-json PATH JSON output file
-t, --threads INTEGER Number of threads to use [default: 1]
--out-tsv PATH TSV output file
--out-dir PATH Output directory
--export-novel Create FASTA files for (potential) novel alleles
--keep-minimap2 Store the minimap2 output
--min-id-novel INTEGER Minimum % identity for novel alleles [default: 99]
-m, --multi [all|first|longest]
Strategy to handle multiple perfect hits [default: all]
--loci TEXT Limit to these loci, provided as comma seperated string (e.g., 'abcZ,fumC')
--debug Enable debug mode
--log PATH Save log to this file
--help Show this message and exit
By default, the output will be generated in JSON format and stored in mist.json.
The JSON output contains three main sections:
-
alleles: dictionary with allele calls for each locus -
profile: best matching (cg)ST profile and metadata (including % of matching loci) -
metadata: analysis metadata (timestamp, tool version, etc.)
If the --out-tsv option is set, an additional TSV file is generated with the following columns:
-
locus: target locus -
allele: detected allele -
length: length of the detected sequence -
contig: contig on which the locus was detected -
start: start position in the contig -
end: end position in the contig -
strand: strand on which the locus was detected -
is_novel: whether the allele is novel (boolean) -
closest_alleles: Closest alleles (only for novel alleles!) -
tags: tags (see below)
If multiple matches are found, the corresponding values are listed, separated by semicolons (;).
If the --out-dir options is set, additional output files will be stored in this directory (if enabled).
results/
├── mist.log # Log file
├── minimap2_parsed.tsv # parsed Minimap2 alignments (if --keep-minimap2 is set)
└── novel_alleles/ # FASTA files of novel alleles
├── gdh_n462f8f.fasta # The name corresponds to the locus name followed by the sequence hash
└── ...
Example output for a perfect hit of the pdhC locus:
{
"allele_str": "3",
"allele_results": [
{
"allele": "3",
"alignment": {
"seq_id": "gi|77358697|ref|NC_003112.2|",
"start": 1360856,
"end": 1361335,
"strand": "+"
},
"sequence": null,
"closest_alleles": null
}
],
"tags": []
}The dictionary contains the following entries:
-
allele_str: detected allele as a string -
allele_results: all alignments -
tags: additional tags to denote special cases or missing alleles. An overview of the available tags is provided in the table below
| Tag | Description |
|---|---|
| ABSENT | The locus is likely absent, as no seed alignment was found. |
| EDGE | The detected allele is located at the end of a contig and is therefore incomplete. |
| EXACT | Exact match to the sequence in the database. |
| INDEL | The locus is present, but the allele length does not match any known sequence in the database. |
| MULTI | Multiple exact matches are found (see also the --multi parameter). |
| NOVEL | A potential novel allele has been detected that is not present in the current database. |
If a novel allele is detected, the corresponding sequence is included in the JSON output and written to a FASTA file (if --export-novel is set).
Alleles flagged as potential novel alleles are assigned a unique hash derived from their full sequence. The hash is generated using the SHA-1 algorithm applied to the lowercase version of the sequence in the forward-strand orientation.
📌 Disclaimer: It is strongly recommended to submit the valid novel alleles to the underlying databases.
Example (sequence truncated for clarity):
{
"allele_str": "*462f",
"allele_results": [
{
"allele": "*462f",
"alignment": {
"seq_id": "gi|77358697|ref|NC_003112.2|",
"start": 1419413,
"end": 1419913,
"strand": "-"
},
"sequence": "ATGTTCGAGCCGCTGTGGAACAATAA...",
"closest_alleles": [
"gdh_5",
"gdh_67"
]
}
],
"tags": [
"NOVEL"
]
}