many_multiple_align.nf is a pipeline for quickly performing many multiple sequence alignments; many_multiple_align.nf can perform multiple sequence alignments on 100+ FASTA files containing 100+ sequences each in minutes. The program repeatedly calls Muscle5, a widely used multiple sequence alignment algorithm capable of running on multiple threads, on all FASTA files found in the input directory. many_multiple_align.nf cannot be run locally on a macOS system.
git clone https://github.com/AndersenLab/many_multiple_align.git
# test
cd many_multiple_align
nextflow many_multiple_align.nf --in test_data
The Muscle5 binaries are included in the installation and can be found in the bin directory.
In addition to Nextflow, the user should have the Biopython package present in their PATH. The following command will install Biopython.
pip install biopython
many_multiple_align.nf requires one input directory path. The directory should contain at least one FASTA-formatted file (.fa or .fasta). The sequences in each FASTA file will be aligned, and a concensus sequence will be generated.
nextflow many_multiple_align.nf --in <input_directory_path>
For smaller tasks, many_multiple_align.py can be used in place of many_multiple_align.nf to avoid unnecessarily overloading QUEST with job submissions. Unlike many_multiple_align.nf, many_multiple_align.py can be run locally on a macOS system.
python3 many_multiple_align.py <input_directory_path> # Linux OS
python3 many_multiple_align.py <input_directory_path> macintel # Mac OS Intel
python3 many_multiple_align.py <input_directory_path> macarm # Mac OS ARM (M1+)
many_multiple_align.nf and many_multiple_align.py will output a directory named mmalign_MMDDYY (insert date). The structure of the output directory is given below.
mmalign_MMDDYY
├── alignment # contains multiple alignments in clustal (.aln) format
├── alignment_efa # contains multiple alignments in fasta (.efa) format
├── consensus # contains consensus sequences for each alignment
└── all_consensus.fa # single fasta file containing all consensus sequences
By default, consensus sequences are generated by removing insertions present in <= 20% of aligned sequences. If all insertions should be included in the consensus sequences, then the following modification will call the Biopython dumb_consensus method to generate consensus sequences instead.
nextflow many_multiple_align.nf --in <input_directory_path> --cons dumb
python3 many_multiple_align.py <input_directory_path> dumb # Linux OS
python3 many_multiple_align.py <input_directory_path> macintel dumb # Mac OS Intel
python3 many_multiple_align.py <input_directory_path> macarm dumb # Mac OS ARM (M1+)
The tool clean_consensus.py will produce a new FASTA file, cleaned_all_consensus.fa, which contains all consensus sequences that are composed of < 30% ambiguous nucleotides.
python3 clean_consensus.py mmalign_MMDDYY/all_consensus.fa
Muscle5: Edgar, RC (2021), MUSCLE v5 enables improved estimates of phylogenetic tree confidence by ensemble bootstrapping, bioRxiv 2021.06.20.449169. https://doi.org/10.1101/2021.06.20.449169.