ACMGA

AnchorWave-Cactus Multiple Genome Alignment (ACMGA) is a reference-free multiple-genome alignment pipeline. It leverages the power of AnchorWave, a pairwise genome alignment tool that utilizes collinearity and global alignment algorithms. This enables ACMGA to effectively align repetitive sequence regions and accurately identify long INDELs (>50bp). Moreover, ACMGA incorporates the Progressive Cactus algorithm to generate ancestor sequences and implement progressive strategies. This combination of techniques makes ACMGA particularly well-suited for aligning plant genomes that are enriched with repetitive sequences. The simplified schema of the pipeline is depicted below.

Downloading Code

https://github.com/HFzzzzzzz/ACMGA.git

Note: genome.fasta do not start with "N",else it will cause error.

Building Environment

ACMGA supports building the environment using the Docker image or locally.

Using the Docker image:

The parameter model in config.yaml needs to be set to docker mode (defalut).

ACMGA currently relies on Snakemake (>6.0.0), Docker, and Singularity. Please make sure these dependencies are installed before running ACMGA. We recommend using this approach.

1、Create a conda environment named "acmga" with Python 3.10 and Snakemake(>6.0)
```
 conda install -n base -c conda-forge mamba
 conda activate base
 mamba create -c conda-forge -c bioconda -n acmga python=3.10 snakemake
```
2、Install Docker and Singularity following the documentation instructions
- Singularity installation guide
- Docker installation guide
Building the local environment:

The parameter model in config.yaml needs to be set to local mode.
- Python3.10
- Biopython
- Snakemake(>6.0)
- AnchorWave.(v1.2.3 or later)
- Cactus(v2.7.0)
- SAMtools
- Minimap2
- bedtools
- bedToGenePred
- genePredToGtf
- GffRead
- k8
- maf-convert
Using this approach, you need add the path of the executable program to the PATH.

Testing the pipeline

Building Environment using Docker

1、Activate the environment and install biopython

 conda activate acmga
 conda install -c conda-forge biopython

2、Generate `command.sh` (bash script for the entire process)

 cd ACMGA
 snakemake  -j 5 --configfile config/config.yaml   --use-singularity  --singularity-args "-B  $(pwd)"

3、Run `command.sh`

 sudo docker run -v $(pwd):/data --rm -it mgatools/acmga:1.0
 sh command.sh

Quickstart

For a quickstart with your own data, you can follow the instructions below. We recommend testing the pipeline with our test data first to ensure the pipeline will work correctly.

After testing the pipeline, the environment has been build successfully. Now you just need to prepare your own data and modify the configuration myconfig.yaml file to run multiple genome alignment on your own data.

You can now prepare the run with the pipeline by doing the following:

Placing your FASTA sequences, Gff files (suffixed with .gff3), and a guide tree into ACMGA/data/.
Placing the CDS sequences set from all the input genomes into ACMGA/data/.
For example

3.1 Copying ACMGA/config/config.yaml to ACMGA/config/myconfig.yaml

3.2 Editing the ACMGA/config/myconfig.yaml to include :
- Input FASTA sequences name (parameter fasta: ).
- Input GFF files name and ancestral GFF files name (parameter gff3: ).
- Path for the collection of CDS (parameter nonDuplicateCDS: ), using this script to merge CDS files and obtain non_duplicate_CDS.fa.
- Path of the FASTA and the GFF files (parameter path: ).
- Species name (parameter species: ).
- The name of the ancestor sequence (parameter ancestor: ).
- Path of guide tree (parameter Tree: ), generated using recommended steps.

The pipeline can then be executed from the ACMGA/ directory in two steps.

1.The first step generates the command.sh script in the ACMGA/

cd ACMGA
snakemake  -j 5 --configfile config/myconfig.yaml   --use-singularity  --singularity-args "-B $(pwd)"

2.The second step is to enter the docker environment and run command.sh

docker login
docker run -v $(pwd):/data --rm -it mgatools/acmga:1.0
sh command.sh

Please note

The suffix of the Gff file of the input genome and the generated ancestral genome is .gff3 in parameter gff:. There is no requirement for the DNA sequence file name(.fasta or .fa).
The species name (parameter species:) should be a substring of the fasta and gff names.
The names of the ancestor genome comes from the tree structure. (such as (Ler:0.00493038,(Cvi:0.0145906,(Arabidopsis_thaliana:0.0117518,An-1:0.00833626)N2:0.0070939)N1:0.00493038)N0;.The name of the ancestor genome are N2,N1,N0).
The number of threads for AnchorWave and Cactus (parameter proaliParamters, genoaliParamters, cactus_threads)should be set according to your memory configuration to prevent memory overflow.
The chromosome names in fasta and gff of the species need to be consistent. Special characters such as spaces cannot be carried. The best example is chr{chromosome number}.
Sequences that are not at the chromosome level can be removed, such as scaf.

Generate a guide tree

1、Use the GEAN tool to generate protein sequences by inputting FASTA, GFF files, and corresponding CDS and gene sequence files

./gean gff2seq -i /media/zhf/ext1/Downloads/gff/gff_chr1-5/Cvi.protein-coding.genes.v2.5.2019-10-09.gff3 -r /media/zhf/ext1/Downloads/fasta/fasta_chr1-5/Cvi.chr.all.v2.0.fasta -p /media/zhf/ext1/Downloads/protein/Cvi.protein.fa -c /media/zhf/ext1/Downloads/protein/Cvi.cds.fa -g /media/zhf/ext1/Downloads/protein/Cvi.gene.fa

./gean gff2seq -i /media/zhf/ext1/Downloads/gff/gff_chr1-5/An-1.protein-coding.genes.v2.5.2019-10-09.gff3 -r /media/zhf/ext1/Downloads/fasta/fasta_chr1-5/An-1.chr.all.v2.0.fasta -p /media/zhf/ext1/Downloads/protein/An-1.protein.fa -c /media/zhf/ext1/Downloads/protein/An-1.cds.fa -g /media/zhf/ext1/Downloads/protein/An-1.gene.fa

./gean gff2seq -i /media/zhf/ext1/Downloads/gff/gff_chr1-5/Ler.protein-coding.genes.v2.5.2019-10-09.gff3 -r /media/zhf/ext1/Downloads/fasta/fasta_chr1-5/Ler.chr.all.v2.0.fasta -p /media/zhf/ext1/Downloads/protein/Ler.protein.fa -c /media/zhf/ext1/Downloads/protein/Ler.cds.fa -g /media/zhf/ext1/Downloads/protein/Ler.gene.fa

./gean gff2seq -i /media/zhf/ext1/Downloads/gff/gff_chr1-5/Arabidopsis_thaliana.TAIR10.56.gff3 -r /media/zhf/ext1/Downloads/fasta/fasta_chr1-5/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa -p /media/zhf/ext1/Downloads/protein/Arabidopsis_thaliana.protein.fa -c /media/zhf/ext1/Downloads/protein/Arabidopsis_thaliana.cds.fa -g /media/zhf/ext1/Downloads/protein/Arabidopsis_thaliana.gene.fa

2、Use the OrthoFinder tool to generate a guide tree

Create the ExampleData folder in the OrthoFinder directory. Put the generated protein sequence into the ExampleData folder. Use the following command to generate a guide tree.

OrthoFinder/orthofinder -f OrthoFinder/ExampleData

OrthoFinder/ExampleData/OrthoFinder/Results_xxx/Species_Tree/SpeciesTree_rooted_node_labels.txt is the generated guide tree file.

Common errors

When the Snakemake run terminates with an error despite Snakemake (version > 6.0.0) being correctly installed, there are several common causes related to input files:

Input FASTA files and GFF files in the data/ directory do not matching samples listed in the config file parameters species.
Input FASTA files and GFF files having chromosomes/scaffolds with special characters; ideally, use names consisting of alphanumeric characters only, such as chr1.
The config.yaml ancestor parameters not being right. Set ancestor nodes according to your tree file. Ancestor nodes include all non-leaf nodes in the tree.
If the test case fails, please check for incomplete data downloads due to network problems.

Script description

Under ACMGA/workflow/scripts/ path, there are three scripts at hand:

ACMGAPipeline.py is the main use is to generate command.sh.
CombineCDS.py is the main use of combined CDS statement.
replace_ref_que.py use paf statement format cactus_consolidated demand statement format

How to cite

If you use ACMGA in your work, please cite:

Name		Name	Last commit message	Last commit date
Latest commit History 133 Commits
config		config
data		data
result		result
workflow		workflow
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ACMGA

Table of Contents

Downloading Code

Building Environment

ACMGA supports building the environment using the Docker image or locally.

Using the Docker image:

1、Create a conda environment named "acmga" with Python 3.10 and Snakemake(>6.0)

2、Install Docker and Singularity following the documentation instructions

Building the local environment:

Testing the pipeline

Building Environment using Docker

1、Activate the environment and install biopython

2、Generate `command.sh` (bash script for the entire process)

3、Run `command.sh`

Quickstart

Please note

Generate a guide tree

1、Use the GEAN tool to generate protein sequences by inputting FASTA, GFF files, and corresponding CDS and gene sequence files

2、Use the OrthoFinder tool to generate a guide tree

Explanation of Output files

HAL Calling Vairants

Troubleshoot

Common errors

Script description

How to cite

About

Releases

Packages

Contributors 2

Languages

HFzzzzzzz/ACMGA

Folders and files

Latest commit

History

Repository files navigation

ACMGA

Table of Contents

ACMGA supports building the environment using the Docker image or locally.

1、Create a conda environment named "acmga" with Python 3.10 and Snakemake(>6.0)

2、Install Docker and Singularity following the documentation instructions

Building the local environment:

1、Activate the environment and install biopython

2、Generate command.sh (bash script for the entire process)

3、Run command.sh

1、Use the GEAN tool to generate protein sequences by inputting FASTA, GFF files, and corresponding CDS and gene sequence files

2、Use the OrthoFinder tool to generate a guide tree

Common errors

About

Resources

Stars

Watchers

Forks

Languages

2、Generate `command.sh` (bash script for the entire process)

3、Run `command.sh`