Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
182 lines (127 sloc) 7.15 KB

Multiple sequence alignment file formats

Easel programs can input and output ten different multiple sequence alignment formats. There are five main formats:

format i.e. suffix
stockholm Stockholm .sto, .sth, .stk
afa aligned FASTA .afa, .afasta
clustal CLUSTAL
phylip interleaved PHYLIP .ph, .phy, .phyi
selex SELEX .slx, .selex

and five variants:

format i.e. is like: but: suffix
pfam Pfam stockholm is restricted to one block .pfam
a2m UCSC A2M, dotless afa has additional semantics for consensus columns .a2m
clustallike Clustal-like clustal has another program name on first line (e.g. MUSCLE)
phylips sequential Phylip phylip "sequential", rather than "interleaved" .phys
psiblast NCBI PSI-BLAST selex is just an alignment, has no selex annotation lines .pb

The format code is what you type to select a format in a command line option, as in --informat selex or --outformat afa. These codes are treated case-insensitively, so --informat SELEX or --outformat AFA are also fine.

How alignment file formats are guessed

Normally when you open an alignment file, an Easel-based program tries to guess its format. This saves typing and synapses when you're working at the command line.

The guesser will never misidentify the format in a way that would corrupt the input alignment or change the annotation. There are formats that are problematic to distinguish based on content alone: afa versus a2m, and phylip versus phylips.

For PHYLIP files, if no hint is available from a file suffix, the guesser will nonetheless almost always be able tell the difference and call phylip versus phylips. Pathological edge cases do exist, though, where the guesser will return an error about not being able to distinguish interleaved from sequential.

However, afa and a2m files are so easily confusable that the guesser will not try to distinguish them based on content alone. The only way to get the guesser to call a2m is on a file with an explicit .a2m suffix.

If you are doing scripted high throughput analysis on files in one of these formats, consider specifying your input file format and disabling the format guesser. The commandline option for this is usually something like --informat <fmtcode>. Alternatively, use file suffixes: .afa versus .a2m, or .ph/.phy/.phyi versus .phys to tip off the guesser.

The guesser works with the following information:

  • an initial guess based on peeking at the first line of the input
  • if the input is a file with a file name, it uses the suffix as a clue (to distinguish .a2m versus .afa, or .phyi from .phys, for example)
  • in more difficult cases, the guesser looks more deeply into the input

More specifically:

stockholm, pfam formats

If the first line starts with # STOCKHOLM: guess stockholm, unless the file suffix is .pfam, then guess pfam.

Pfam format is just Stockholm, but restricted to a single alignment block. There is no difference in the alignment or annotation, so it is harmless to read a Pfam file as Stockholm.

afa, a2m formats

If the first line starts with >: if the file suffix is .a2m, guess a2m. Otherwise, call afa.

The guesser does not autodetect a2m format unless we have a .a2m suffix on the file, even though it is usually possible to distinguish afa from a2m. In afa, the number of aligned characters is always the same but the number of upper case + dash characters can vary, whereas the opposite is true for a2m. However, it is common to have an afa format alignment that consists of all upper case and dashes:

>seq1
GGG-CCC-TT
>seq2
GG-GCC-TT-

which is also valid as a2m. Although the alignment would be the same in either format, in a2m we would infer reference consensus annotation, and in afa we wouldn't. The guesser is not allowed to risk altering either alignment or annotation. Therefore a2m input requires something affirmative like the .a2m file suffix or a --informat a2m option.

It's also worth noting that other ambiguous cases exist that imply different alignments in the two formats, as in this singularly terrifying example:

this input:    means in AFA:    means in A2M:
>seq1          seq1 AAAcAA      seq1 A.AAcAA 
AAAcAA         seq2 AcAAAA      seq2 AcAA.AA
>seq2         
AcAAAA

clustal, clustallike formats

If the first line of the input starts with CLUSTAL, guess clustal. If the first line contains the phrase multiple sequence alignment, guess clustallike. The file suffix doesn't matter.

Clustal and Clustal-like formats are parsed identically. The only difference is the name of the program on the first line.

phylip, phylips formats

If the first line of the input starts with two integers, assume that they are nseq and alen, the number of sequences and number of alignment columns for a Phylip-format alignment that follows. If we have a suffix and it is .ph, .phy, or .phyi, guess phylip; if it is .phys, guess phylips. In both cases, the name width is assumed to be the Phylip standard 10.

Otherwise the guesser then looks deeper into the input to distinguish interleaved from sequential variants of the format, and to check whether the input is using the standard 10-character Phylip name width or a noncanonical width:

  • If the file is consistent with interleaved format, it is called phylip format. The standard 10 character namewidth is tried first, and if that doesn't work, a nonstandard namewidth is determined.

  • else, if the file is consistent with sequential format, it is called phylips format. The standard 10 character namewidth is tried first; if that fails, a nonstandard namewidth is determined.

It is possible to construct pathological files that are consistent with both interleaved and sequential formats. If you're working with sequential Phylip files and you need to guarantee accuracy, use a command line option like --informat phylips.

selex, psiblast formats

If the first line of the input doesn't conform to any of the formats above, and we have a suffix .slx, guess selex; if we have a suffix .pb, guess psiblast.

Otherwise the guesser looks deeper, and tests for whether the input consistent with SELEX format; if it is, guess selex.

Because PSI-BLAST is a strict subset, any file consistent with SELEX format will be guessed to be selex; reading a psiblast file as selex is harmless. If you have a legitimate psiblast file and you want to enforce stricter parsing, use a .pb file suffix on it, or use a commandline option like --informat psiblast to bypass the guesser.

You can’t perform that action at this time.