Skip to content

Running MiST

BertBog edited this page Feb 18, 2026 · 5 revisions

MiST is a command-line tool used for allele calling and (cg)MLST profiling. It compares input assemblies or contigs in FASTA format to an indexed (cg)MLST database and reports the best-matching alleles and profiles, as well as potential novel alleles.

Example usage

The only required options are --db and --fasta.

Minimal example

mist call --db neisseria/mlst --fasta input_contigs.fasta

Extended example (with multiple options and 8 threads)

mist call \
  --db neisseria/mlst \
  --fasta input_contigs.fasta \
  --out-tsv alleles.tsv \
  --out-dir results \
  --threads 8

Options

The following options are available:

Usage: mist call [OPTIONS]

  Calls alleles from a FASTA file.

Options:
  -f, --fasta FILE                Input FASTA path  [required]
  -d, --db PATH                   Database path  [required]
  -o, --out-json PATH             JSON output file
  -t, --threads INTEGER           Number of threads to use  [default: 1]
  --out-tsv PATH                  TSV output file
  --out-dir PATH                  Output directory
  --export-novel                  Create FASTA files for (potential) novel alleles
  --keep-minimap2                 Store the minimap2 output
  --min-id-novel INTEGER          Minimum % identity for novel alleles [default: 99]
  -m, --multi [all|first|longest]
                                  Strategy to handle multiple perfect hits [default: all]
  --loci TEXT                     Limit to these loci, provided as comma seperated string (e.g., 'abcZ,fumC')
  --debug                         Enable debug mode
  --log PATH                      Save log to this file
  --help                          Show this message and exit

Output files

JSON output

By default, the output will be generated in JSON format and stored in mist.json.

The JSON output contains three main sections:

  • alleles: dictionary with allele calls for each locus
  • profile: best matching (cg)ST profile and metadata (including % of matching loci)
  • metadata: analysis metadata (timestamp, tool version, etc.)

TSV output

If the --out-tsv option is set, an additional TSV file is generated with the following columns:

  • locus: target locus
  • allele: detected allele
  • length: length of the detected sequence
  • contig: contig on which the locus was detected
  • start: start position in the contig
  • end: end position in the contig
  • strand: strand on which the locus was detected
  • is_novel: whether the allele is novel (boolean)
  • closest_alleles: Closest alleles (only for novel alleles!)
  • tags: tags (see below)

If multiple matches are found, the corresponding values are listed, separated by semicolons (;).

Additional outputs

If the --out-dir options is set, additional output files will be stored in this directory (if enabled).

results/
├── mist.log              # Log file
├── minimap2_parsed.tsv   # parsed Minimap2 alignments (if --keep-minimap2 is set)
└── novel_alleles/        # FASTA files of novel alleles
    ├── gdh_n462f8f.fasta # The name corresponds to the locus name followed by the sequence hash
    └── ...

JSON output format

Example output for a perfect hit of the pdhC locus:

{
  "allele_str": "3",  
  "allele_results": [
    {
      "allele": "3",
      "alignment": {
        "seq_id": "gi|77358697|ref|NC_003112.2|",
        "start": 1360856,
        "end": 1361335,
        "strand": "+"
      },
      "sequence": null,
      "closest_alleles": null
    }
  ],
  "tags": []
}

The dictionary contains the following entries:

  • allele_str: detected allele as a string
  • allele_results: all alignments
  • tags: additional tags to denote special cases or missing alleles. An overview of the available tags is provided in the table below

Tags

Tag Description
ABSENT The locus is likely absent, as no seed alignment was found.
EDGE The detected allele is located at the end of a contig and is therefore incomplete.
EXACT Exact match to the sequence in the database.
INDEL The locus is present, but the allele length does not match any known sequence in the database.
MULTI Multiple exact matches are found (see also the --multi parameter).
NOVEL A potential novel allele has been detected that is not present in the current database.

Novel alleles

If a novel allele is detected, the corresponding sequence is included in the JSON output and written to a FASTA file (if --export-novel is set).

Alleles flagged as potential novel alleles are assigned a unique hash derived from their full sequence. The hash is generated using the SHA-1 algorithm applied to the lowercase version of the sequence in the forward-strand orientation.

📌 Disclaimer: It is strongly recommended to submit the valid novel alleles to the underlying databases.

Example (sequence truncated for clarity):

{
  "allele_str": "*462f",
  "allele_results": [
    {
      "allele": "*462f",
      "alignment": {
    "seq_id": "gi|77358697|ref|NC_003112.2|",
        "start": 1419413,
        "end": 1419913,
        "strand": "-"
      },
      "sequence": "ATGTTCGAGCCGCTGTGGAACAATAA...",
      "closest_alleles": [
        "gdh_5",
        "gdh_67"
      ]
    }
  ],
  "tags": [
     "NOVEL"
   ]
}

Clone this wiki locally