Skip to content

AndersenLab/many_multiple_align

Repository files navigation

many_multiple_align

many_multiple_align.nf is a pipeline for quickly performing many multiple sequence alignments; many_multiple_align.nf can perform multiple sequence alignments on 100+ FASTA files containing 100+ sequences each in minutes. The program repeatedly calls Muscle5, a widely used multiple sequence alignment algorithm capable of running on multiple threads, on all FASTA files found in the input directory. many_multiple_align.nf cannot be run locally on a macOS system.

Installation

git clone https://github.com/AndersenLab/many_multiple_align.git

# test
cd many_multiple_align
nextflow many_multiple_align.nf --in test_data

The Muscle5 binaries are included in the installation and can be found in the bin directory.

In addition to Nextflow, the user should have the Biopython package present in their PATH. The following command will install Biopython.

pip install biopython

Usage

many_multiple_align.nf requires one input directory path. The directory should contain at least one FASTA-formatted file (.fa or .fasta). The sequences in each FASTA file will be aligned, and a concensus sequence will be generated.

nextflow many_multiple_align.nf --in <input_directory_path>

For smaller tasks, many_multiple_align.py can be used in place of many_multiple_align.nf to avoid unnecessarily overloading QUEST with job submissions. Unlike many_multiple_align.nf, many_multiple_align.py can be run locally on a macOS system.

python3 many_multiple_align.py <input_directory_path>              # Linux OS
python3 many_multiple_align.py <input_directory_path> macintel     # Mac OS Intel
python3 many_multiple_align.py <input_directory_path> macarm       # Mac OS ARM (M1+)

Output

many_multiple_align.nf and many_multiple_align.py will output a directory named mmalign_MMDDYY (insert date). The structure of the output directory is given below.

mmalign_MMDDYY
    ├── alignment           # contains multiple alignments in clustal (.aln) format
    ├── alignment_efa       # contains multiple alignments in fasta (.efa) format
    ├── consensus           # contains consensus sequences for each alignment
    └── all_consensus.fa    # single fasta file containing all consensus sequences

Other options

By default, consensus sequences are generated by removing insertions present in <= 20% of aligned sequences. If all insertions should be included in the consensus sequences, then the following modification will call the Biopython dumb_consensus method to generate consensus sequences instead.

nextflow many_multiple_align.nf --in <input_directory_path> --cons dumb
python3 many_multiple_align.py <input_directory_path> dumb                  # Linux OS
python3 many_multiple_align.py <input_directory_path> macintel dumb         # Mac OS Intel
python3 many_multiple_align.py <input_directory_path> macarm dumb           # Mac OS ARM (M1+)

The tool clean_consensus.py will produce a new FASTA file, cleaned_all_consensus.fa, which contains all consensus sequences that are composed of < 30% ambiguous nucleotides.

python3 clean_consensus.py mmalign_MMDDYY/all_consensus.fa

Citation

Cite this pipeline: DOI

Muscle5: Edgar, RC (2021), MUSCLE v5 enables improved estimates of phylogenetic tree confidence by ensemble bootstrapping, bioRxiv 2021.06.20.449169. https://doi.org/10.1101/2021.06.20.449169.

About

Rapid multiple sequence alignment and consensus sequence generation

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors