rAMPage: Rapid AMP Annotation and Gene Estimation

Description

rAMPage is an in silico anti-microbial peptide (AMP) discovery pipeline that takes in bulk RNA-seq reads and outputs a FASTA file of annotated, confident, short, and charged putative AMPs.

Quick Links

Overview | Poster
Setup
Dependencies
1. Basics
2. Tools
3. Optional
Input
Usage
Directory Structure
Citation

Setup

Clone this repository:

git clone https://github.com/bcgsc/rAMPage.git

Download and install the dependencies (specified in the Dependencies section below), into rAMPage/src.
- some of these dependencies need to be configured: SignalP, ProP, SABLE, EnTAP (see configurations)
- install AMPlify using conda (required-- biopython and pandas are dependencies for other scripts other than AMPlify)
```
 cd rAMPage
 conda create --prefix src/AMPlify python=3.6
 conda activate AMPlify
 conda install -c bioconda amplify
```
Update all the paths in rAMPage/scripts/config.sh to reflect dependencies in rAMPage/src and dependencies pre-installed elsewhere.
Source scripts/config.sh in the root of the repository.
```
source scripts/config.sh
```
Create working directories for each dataset using this convention: taxonomic-class/species/tissue-or-condition
- NOTE: the top-level parent directory must correspond to the taxonomic class of the dataset. This class is used to choose which file in amp_seqs to use for homology search.
- e.g. M. gulosa: insecta/mgulosa/venom-gland
- e.g. P. toftae: amphibia/ptoftae/skin-liver
Move all reads and reference FASTA files to the respective working directories for each dataset. See below for an example.
Create a 2 or 3-column space-delimited text file as specified by the Input section below, called input.txt, in the working directory of each dataset.

At the end of setup, you should have a directory structure similar to below (excludes other directories, like scripts/):

rAMPage
├── amphibia
│   └── ptoftae
│       └── skin-liver
│           ├── input.txt
│           └── raw_reads
│               ├── SRR8288040_1.fastq.gz
│               ├── SRR8288040_2.fastq.gz
│               ├── SRR8288041_1.fastq.gz
│               ├── SRR8288041_2.fastq.gz
│               ├── SRR8288056_1.fastq.gz
│               ├── SRR8288056_2.fastq.gz
│               ├── SRR8288057_1.fastq.gz
│               ├── SRR8288057_2.fastq.gz
│               ├── SRR8288058_1.fastq.gz
│               ├── SRR8288058_2.fastq.gz
│               ├── SRR8288059_1.fastq.gz
│               ├── SRR8288059_2.fastq.gz
│               ├── SRR8288060_1.fastq.gz
│               ├── SRR8288060_2.fastq.gz
│               ├── SRR8288061_1.fastq.gz
│               └── SRR8288061_2.fastq.gz
└── insecta
    └── mgulosa
        └── venom
            ├── input.txt
            ├── raw_reads
            │   ├── SRR6466797_1.fastq.gz
            │   └── SRR6466797_2.fastq.gz
            └── tsa.GGFG.1.fsa_nt.gz

Dependencies

Basics

Dependency	Tested Version
GNU `bash`	v5.0.11(1)
GNU `awk`	v5.0.1
GNU `sed`	v4.8
GNU `grep`	v3.4
GNU `make`	v4.3
GNU `column`	2.36
Miller `mlr`	5.4.0
`bc`	v1.07.1
`gzip`	v1.10
`python`	v3.7.7
`Rscript`*	v4.0.2

*requires tidyverse v1.3.0, glue v1.4.2, and docopt v0.7.1.

Tools

Dependency	Tested Version
SRA toolkit	v2.10.5
EDirect	v13.8
fastp	v0.20.0
RNA-Bloom	v1.3.1
salmon	v1.3.0
TransDecoder	v5.5.0
HMMER	v3.3.1
cd-hit	v4.8.1
seqtk	v1.1-r91
SignalP	v3.0
ProP	v1.0c
AMPlify	v1.1.0
E_NTAP	v0.10.7-beta
Exonerate	v2.4.0
SABLE	v4.0
Clustal Omega	v1.2.4

Configurations

Configuring SignalP

To download SignalP, you must enter your email address and institution. Afterwards, a download link valid for 4 hours will be emailed to you. Clicking on the link will show you one link for each system (e.g. Linux). Click the link to download, or right click to copy the link and download on the command line using curl or wget.

After moving the downloaded signalp-3.0.Linux.tar.Z file to src, decompress it:

cd src/
cat signalp-3.0.Linux.tar.Z | uncompress | tar xvf -

The file to edit is src/signalp-3.0/signalp:

Before	After
`SIGNALP=/usr/opt/signalp-3.0`	`SIGNALP=$ROOT_DIR/src/signalp-3.0`
`AWK=nawk`	`AWK=awk`

Note: More changes may need to be made according to what executables are accessible in your PATH variable and on your system. For FULL installation instructions, please read src/signalp-3.0/signalp-3.0.readme in detail.

The experimental scripts/helpers/install_prop.sh can be used to install SignalP with the changes listed above, but more changes may be required. Make sure that SignalP works with the test datasets in its directory before running rAMPage, e.g.

cd src/signalp-3.0
./signalp -t euk test/test.seq

Configuring ProP

To download ProP, you must enter your email address and institution. Afterwards, a download link valid for 4 hours will be emailed to you. Clicking on the link will show one link for each system (e.g. Linux). Click the link to download, or right click to copy the link and download on the command line using curl or wget.

After moving the downloaded prop-1.0c.Linux.tar.Z file to src, decompress it:

cd src/
cat prop-1.0c.Linux.tar.Z | uncompress | tar xvf -

The file to edit is src/prop-1.0c/prop:

Before	After
`setenv PROPHOME /usr/cbs/packages/prop/1.0c/prop-1.0c`	`setenv PROPHOME $ROOT_DIR/src/prop-1.0c`
*`setenv SIGNALP /usr/cbs/bio/bin/signalp`	`setenv SIGNALP $ROOT_DIR/src/signalp-3.0/signalp`

*edit the one corresponding to your system, Linux used in the example

Note: More changes may need to be made according to what executables are accessible in your PATH variable and on your system. For FULL installation instructions, please read src/prop-1.0c/prop-1.0c.readme in detail.

The experimental scripts/helpers/install_prop.sh can be used to install ProP with the changes listed above, but more changes may be required. Make sure that ProP works with the test datasets in its directory before running rAMPage, e.g.

cd src/prop-1.0c
./prop -s test/EDA_HUMAN.fsa

Configuring E_NTAP

Download and decompress the following databases:

Database	Example Download Code
RefSeq: Non-mammalian Vertebrates (for `amphibia`)	`wget -O vertebrate_other_protein.faa.gz ftp://ftp.ncbi.nlm.nih.gov/refseq/release/vertebrate_other/vertebrate_other.*.protein.faa.gz`
RefSeq: Invertebrates (for `insecta`)	`wget -O invertebrate_protein.faa.gz ftp://ftp.ncbi.nlm.nih.gov/refseq/release/invertebrate/invertebrate.*.protein.faa.gz`
SwissProt	`wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz`
NCBI `nr`	`wget -O nr.fasta.gz ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz`

After decompression, the databases can be configured using scripts/config-entap.sh:

scripts/config-entap.sh -t 8 invertebrate_protein.faa vertebrate_other_protein.faa uniprot_sprot.fasta nr.fasta

The script configures all the databases in the EnTAP-0.10.7-beta/bin directory.

Configuring SABLE

The file to edit is src/sable_v4_distr/run.sable:

Before	After
`remDir=$PWD;`	`remDir=$PWD; THREADS=$1;`
`export SABLE_DIR="/users/radamcza/work/newSable/sable_distr";`	`export SABLE_DIR="$ROOT_DIR/src/sable_v4_distr";`
`export BLAST_DIR="/usr/local/blast/2.2.28/bin";`	`export BLAST_DIR=$BLAST_DIR`
*`export NR_DIR="/database/ncbi/nr"`	`export NR_DIR=$ROOT_DIR/src/EnTAP-0.10.7-beta/bin/nr`
`export PRIMARY_DATABASE="/users/radamcza/work/newSable/sable_distr/GI_indexes/pfam_index"`	`export PRIMARY_DATABASE="$ROOT_DIR/src/sable_v4_distr/GI_indexes/pfam_index"`
`export SECONDARY_DATABASE="/users/radamcza/work/newSable/sable_distr/GI_indexes/swissprot_index"`	`export SECONDARY_DATABASE="$ROOT_DIR/src/sable_v4_distr/GI_indexes/swissprot_index";`
`mkdir $PBS_JOBID`	`mkdir -p $PBS_JOBID`
`/usr/bin/perl ${SABLE_DIR}/sable.pl`	`perl ${SABLE_DIR}/sable.pl $THREADS`

*After downloading the nr FASTA file (see below), it will need to be configured using BLAST+'s makeblastdb.

Optional

Dependency	Tested Version
GNU `wget`	v1.20.3
`curl`	v7.72.0
`pigz`	v2.4

Input

A 2 or 3-column space-delimited text file named input.txt, located in the working directory of each dataset.

Column	Attribute
1	Pooling ID: generally a condition, tissue, or sex, etc.
2	Path to read 1
3	Path to read 2 (if paired-end reads)

Read paths in this input text file should be relative to the location of the input text file.

Need help downloading reads? The scripts/helpers/get-reads.sh script can be used to download reads. These dependencies are required:

Dependency	Tested Version
SRA toolkit	v2.10.5
EDirect	v13.8

The input runs.txt should have one SRR accession on each line.

Example: M. gulosa

POOLING ID	READ 1	READ 2
venom	raw_reads/SRR6466797_1.fastq.gz	raw_reads/SRR6466797_2.fastq.gz

insecta/mgulosa/venom/input.txt:

venom raw_reads/SRR6466797_1.fastq.gz raw_reads/SRR6466797_2.fastq.gz

Using scripts/helpers/get-reads.sh:

scripts/helpers/get-reads.sh -o insecta/mgulosa/venom/raw_reads -p insecta/mgulosa/venom/runs.txt

insecta/mgulosa/venom/runs.txt:

SRR6466797

Example: P. toftae

POOLING ID	READ 1	READ 2
liver	raw_reads/SRR8288040_1.fastq.gz	raw_reads/SRR8288040_2.fastq.gz
skin	raw_reads/SRR8288041_1.fastq.gz	raw_reads/SRR8288041_2.fastq.gz
liver	raw_reads/SRR8288056_1.fastq.gz	raw_reads/SRR8288056_2.fastq.gz
skin	raw_reads/SRR8288057_1.fastq.gz	raw_reads/SRR8288057_2.fastq.gz
liver	raw_reads/SRR8288058_1.fastq.gz	raw_reads/SRR8288058_2.fastq.gz
skin	raw_reads/SRR8288059_1.fastq.gz	raw_reads/SRR8288059_2.fastq.gz
liver	raw_reads/SRR8288060_1.fastq.gz	raw_reads/SRR8288060_2.fastq.gz
skin	raw_reads/SRR8288061_1.fastq.gz	raw_reads/SRR8288061_2.fastq.gz

amphibia/ptoftae/skin-liver/input.txt:

liver raw_reads/SRR8288040_1.fastq.gz raw_reads/SRR8288040_2.fastq.gz
skin raw_reads/SRR8288041_1.fastq.gz raw_reads/SRR8288041_2.fastq.gz
liver raw_reads/SRR8288056_1.fastq.gz raw_reads/SRR8288056_2.fastq.gz
skin raw_reads/SRR8288057_1.fastq.gz raw_reads/SRR8288057_2.fastq.gz
liver raw_reads/SRR8288058_1.fastq.gz raw_reads/SRR8288058_2.fastq.gz
skin raw_reads/SRR8288059_1.fastq.gz raw_reads/SRR8288059_2.fastq.gz
liver raw_reads/SRR8288060_1.fastq.gz raw_reads/SRR8288060_2.fastq.gz
skin raw_reads/SRR8288061_1.fastq.gz raw_reads/SRR8288061_2.fastq.gz

Using scripts/helpers/get-reads.sh:

scripts/helpers/get-reads.sh -o amphibia/ptoftae/skin-liver/raw_reads -p amphibia/ptoftae/skin-liver/runs.txt

amphibia/ptoftae/skin-liver/runs.txt:

SRR8288040
SRR8288041
SRR8288056
SRR8288057
SRR8288058
SRR8288059
SRR8288060
SRR8288061

Reference Transcriptomes

To use a reference transcriptome for the assembly stage with RNA-Bloom, put the reference in the working directory or use the -r option of scripts/rAMPage.sh.

insecta/mgulosa/venom
├── input.txt
├── raw_reads
│   ├── SRR6466797_1.fastq.gz
│   └── SRR6466797_2.fastq.gz
└── tsa.GGFG.1.fsa_nt.gzz

In this case, the reference transcriptome is a Transcriptome Shotgun Assembly for M. gulosa, downloaded from ftp://ftp.ncbi.nlm.nih.gov/genbank/tsa/G/tsa.GGFG.1.fsa_nt.gz.

Multiple references can be used as long as they are placed in the working directory.

Sources of References

Representative Genomes can be found by searching the Genome database on NCBI, using these search terms (A. mellifera, for example):

"Apis mellifera"[orgn]

Transcriptome Shotgun Assemblies can be found by searching the Nucleotide database on NCBI, using these search terms:

tsa-master[prop] "Apis mellifera"[orgn] midgut[All Fields]

Usage

The rAMPage.sh script in scripts/ runs the pipeline using a Makefile.

PROGRAM: rAMPage.sh

DESCRIPTION:
      Runs the rAMPage pipeline, using the Makefile.
      
USAGE(S):
      rAMPage.sh [-a <address>] [-b] [-c <taxonomic class>] [-d] [-f] [-h] [-m] [-n <species name>] [-o <output directory>] [-p] [-r <FASTA.gz>] [-s] [-t <int>] [-v] <input reads TXT file>
      
OPTIONS:
       -a <address>    email address for alerts                               
       -c <class>      taxonomic class of the dataset                         (default = top-level directory in $outdir)
       -d              debug mode of Makefile                                 
       -f              force characterization even if no AMPs found           
       -h              show help menu                                         
       -m <target>     Makefile target                                        (default = exonerate)
       -n <species>    taxonomic species or name of the dataset               (default = second-level directory in $outdir)
       -o <directory>  output directory                                       (default = directory of input reads TXT file)
       -p              run processes in parallel                              
       -r <FASTA.gz>   reference transcriptome                                (accepted multiple times, *.fna.gz *.fsa_nt.gz)
       -s              strand-specific library construction                   (default = false)
       -t <int>        number of threads                                      (default = 48)
       -v              print version number                                   
       -E <e-value>    E-value threshold for homology search                  (default = 1e-5)
       -S <3.0103 to 80>     AMPlify score threshold for amphibian AMPs             (default = 10)
       -L <int>        Length threshold for AMPs                              (default = 30)
       -C <int>        Charge threshold for AMPs                              (default = 2)
       -R              Disable redundancy removal during transcript assembly  
                                                                              
EXAMPLE(S):
      rAMPage.sh -a user@example.com -c class -n species -p -s -t 8 -o /path/to/output/directory -r /path/to/reference.fna.gz -r /path/to/reference.fsa_nt.gz /path/to/input.txt 
      
INPUT EXAMPLE:
       tissue /path/to/readA_1.fastq.gz /path/to/readA_2.fastq.gz
       tissue /path/to/readB_1.fastq.gz /path/to/readB_2.fastq.gz
       
MAKEFILE TARGETS:
       01) check        08) homology
       02) reads        09) cleavage
       03) trim         10) amplify
       04) readslist    11) annotation
       05) assembly     12) exonerate
       06) filtering    13) sable
       07) translation  14) all

DESCRIPTION:
      Runs the rAMPage pipeline, using the Makefile.
      
USAGE(S):
      rAMPage.sh [-a <address>] [-b] [-c <taxonomic class>] [-d] [-f] [-h] [-m] [-n <species name>] [-o <output directory>] [-p] [-r <FASTA.gz>] [-s] [-t <int>] [-v] <input reads TXT file>
      
OPTIONS:
       -a <address>    email address for alerts                               
       -c <class>      taxonomic class of the dataset                         (default = top-level directory in $outdir)
       -d              debug mode of Makefile                                 
       -f              force characterization even if no AMPs found           
       -h              show help menu                                         
       -m <target>     Makefile target                                        (default = exonerate)
       -n <species>    taxonomic species or name of the dataset               (default = second-level directory in $outdir)
       -o <directory>  output directory                                       (default = directory of input reads TXT file)
       -p              run processes in parallel                              
       -r <FASTA.gz>   reference transcriptome                                (accepted multiple times, *.fna.gz *.fsa_nt.gz)
       -s              strand-specific library construction                   (default = false)
       -t <int>        number of threads                                      (default = 48)
       -v              print version number                                   
       -E <e-value>    E-value threshold for homology search                  (default = 1e-5)
       -S <3.0103 to 80>     AMPlify score threshold for amphibian AMPs             (default = 10)
       -L <int>        Length threshold for AMPs                              (default = 30)
       -C <int>        Charge threshold for AMPs                              (default = 2)
       -R              Disable redundancy removal during transcript assembly  
                                                                              
EXAMPLE(S):
      rAMPage.sh -a user@example.com -c class -n species -p -s -t 8 -o /path/to/output/directory -r /path/to/reference.fna.gz -r /path/to/reference.fsa_nt.gz /path/to/input.txt 
      
INPUT EXAMPLE:
       tissue /path/to/readA_1.fastq.gz /path/to/readA_2.fastq.gz
       tissue /path/to/readB_1.fastq.gz /path/to/readB_2.fastq.gz
       
MAKEFILE TARGETS:
       01) check        08) homology
       02) reads        09) cleavage
       03) trim         10) amplify
       04) readslist    11) annotation
       05) assembly     12) exonerate
       06) filtering    13) sable
       07) translation  14) all

Choosing Thresholds

The best way to choose score, length, and score thresholds is to plot the distribution of the reference AMPs.

scripts/helpers/plot-dist.sh -a amphibianAMPs.faa -i insectAMPs.faa -t 8 -o /path/to/output/dir -r

Running from the root of the repository

Example: M. gulosa (stranded library construction)

scripts/rAMPage.sh -v -s -o insecta/mgulosa/venom -r insecta/mgulosa/venom/tsa.GGFG.1.fsa_nt.gz -c insecta -n mgulosa insecta/mgulosa/venom/input.txt

In the example above, the -o insecta/mgulosa/venom argument is optional, since the default will be set as parent directory of the input.txt file. This option is a safeguard for the scenario where input.txt is not located in the working directory. In this case, the -o option will move input.txt and provided references to the working directory.

rAMPage will use all *.fsa_nt* and *.fna* files located in the working directory as references in the assembly stage, regardless of if the -r option is used or not. This option is a safeguard for the scenario where the references provided are not located in the working directory. In this case, the -r option will move the references to the working directory.

Running from the working directory of the dataset

Example: M. gulosa (stranded library construction)

$ROOT_DIR/scripts/rAMPage.sh -s -r tsa.GGFG.1.fsa_nt.gz -c insecta -n mgulosa input.txt

Running multiple datasets from the root of the repository

To run rAMPage on multiple datasets, you can use the stAMPede.sh wrapper script. By default, stAMPede.sh will run rAMPage on the datasets consecutively. If the -s option is invoked, they will be run simultaenously in parallel. The -p option allows parallelization of certain processes, such as trimming reads in parallel.

Note: This script is experimental and has fewer options than running rAMPage.sh.

PROGRAM: stAMPede.sh

DESCRIPTION:
      A wrapper around rAMPage.sh to allow running of multiple assemblies.
      
USAGE(S):
      stAMPede.sh [-a <address>] [-d] [-h] [-m] [-p] [-s] [-t <int>] [-v] <accessions TXT file>
      
OPTION(S):
       -a <address>  email address for alerts                                   
       -d            debug mode                                                 
       -h            show help menu                                             
       -m <target>   Makefile target                                            (default = exonerate)
       -p            allow parallel processes for each dataset                  
       -s            simultaenously run rAMPAge on all datasets                 (default if SLURM available)
       -t <int>      number of threads                                          (default = 48)
       -v            verbose (uses /usr/bin/time -pv to time each rAMPage run)  
       -E <e-value>  E-value threshold for homology search                      (default = 1e-5)
       -S <3.0103 to 80>   AMPlify score threshold for amphibian AMPs                 (default = 10)
       -L <int>      Length threshold for AMPs                                  (default = 30)
       -C <int>      Charge threshold for AMPs                                  (default = 2)
                                                                                
ACCESSIONS TXT FORMAT:
       CLASS/SPECIES/TISSUE_OR_CONDITION/input.txt strandedness
       amphibia/ptoftae/skin-liver/input.txt nonstranded
       insecta/mgulosa/venom/input.txt stranded
       
EXAMPLE(S):
      stAMPede.sh -a user@example.com -p -s -v accessions.txt

Input

For running multiple datasets, the multi-input text file should be a 2-column text file:

Column	Attribute
1	path to `input.txt` file
2	`stranded` or `nonstranded`

Example: P. toftae and M. gulosa

`input.txt`	`strandedness`
`amphibia/ptoftae/skin-liver/input.txt`	`nonstranded`
`insecta/mgulosa/venom/input.txt`	`stranded`

multi-input.txt:

amphibia/ptoftae/skin-liver/input.txt nonstranded
insecta/mgulosa/venom/input.txt stranded

AMPs for Synthesis

For reproducibility, clustering AMPs across datasets and choosing AMPs for synthesis are included in the scripts/stAMPede.sh script, but manual clustering can be done using scripts/cluster.sh:

scripts/cluster.sh -o /path/to/outdir amphibia/ptoftae/skin-liver/exonerate insecta/mgulosa/venom/exonerate

These are the 90 AMPs selected for synthesis, but only a subset of 21 have been validated in vitro thus far.

Directory Structure

Example directory structure:

rAMPage
├── amphibia
│   └── ptoftae
│       └── skin-liver
├── amp_seqs
├── insecta
│   └── mgulosa
│       └── venom
├── scripts
└── src

Citation

Lin, D.; Sutherland, D.; Aninta, S.I.; Louie, N.; Nip, K.M.; Li, C.; Yanai, A.; Coombe, L.; Warren, R.L.; Helbing, C.C.; et al. Mining Amphibian and Insect Transcriptomes for Antimicrobial Peptide Sequences with rAMPage. Antibiotics 2022, 11, 952. https://doi.org/10.3390/antibiotics11070952

Name		Name	Last commit message	Last commit date
Latest commit History 363 Commits
amp_seqs		amp_seqs
scripts		scripts
src		src
Flowchart.png		Flowchart.png
LICENSE		LICENSE
README.md		README.md
multi-input.txt		multi-input.txt
rAMPage.png		rAMPage.png

License

bcgsc/rAMPage

Folders and files

Latest commit

History

Repository files navigation

rAMPage: Rapid AMP Annotation and Gene Estimation

Description

Quick Links

Setup

Dependencies

Basics

Tools

Configurations

Configuring SignalP

Configuring ProP

Configuring ENTAP

Configuring SABLE

Optional

Input

Example: M. gulosa

Example: P. toftae

Reference Transcriptomes

Sources of References

Usage

Choosing Thresholds

Running from the root of the repository

Running from the working directory of the dataset

Running multiple datasets from the root of the repository

Input

AMPs for Synthesis

Directory Structure

Citation

About

Resources

License

Stars

Watchers

Forks

Languages

Configuring E_NTAP