Skip to content

bcgsc/rAMPage

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

rAMPage: Rapid AMP Annotation and Gene Estimation

Written by Diana Lin.

Description

rAMPage is an in silico anti-microbial peptide (AMP) discovery pipeline that takes in bulk RNA-seq reads and outputs a FASTA file of annotated, confident, short, and charged putative AMPs.

Quick Links

  1. Overview | Poster
  2. Setup
  3. Dependencies
    1. Basics
    2. Tools
    3. Optional
  4. Input
  5. Usage
  6. Directory Structure
  7. Citation

Setup

  1. Clone this repository:
    git clone https://github.com/bcgsc/rAMPage.git
    
  2. Download and install the dependencies (specified in the Dependencies section below), into rAMPage/src.
    • some of these dependencies need to be configured: SignalP, ProP, SABLE, EnTAP (see configurations)
    • install AMPlify using conda (required-- biopython and pandas are dependencies for other scripts other than AMPlify)
       cd rAMPage
       conda create --prefix src/AMPlify python=3.6
       conda activate AMPlify
       conda install -c bioconda amplify
  3. Update all the paths in rAMPage/scripts/config.sh to reflect dependencies in rAMPage/src and dependencies pre-installed elsewhere.
  4. Source scripts/config.sh in the root of the repository.
    source scripts/config.sh
  5. Create working directories for each dataset using this convention: taxonomic-class/species/tissue-or-condition
    • NOTE: the top-level parent directory must correspond to the taxonomic class of the dataset. This class is used to choose which file in amp_seqs to use for homology search.
    • e.g. M. gulosa: insecta/mgulosa/venom-gland
    • e.g. P. toftae: amphibia/ptoftae/skin-liver
  6. Move all reads and reference FASTA files to the respective working directories for each dataset. See below for an example.
  7. Create a 2 or 3-column space-delimited text file as specified by the Input section below, called input.txt, in the working directory of each dataset.

At the end of setup, you should have a directory structure similar to below (excludes other directories, like scripts/):

rAMPage
├── amphibia
│   └── ptoftae
│       └── skin-liver
│           ├── input.txt
│           └── raw_reads
│               ├── SRR8288040_1.fastq.gz
│               ├── SRR8288040_2.fastq.gz
│               ├── SRR8288041_1.fastq.gz
│               ├── SRR8288041_2.fastq.gz
│               ├── SRR8288056_1.fastq.gz
│               ├── SRR8288056_2.fastq.gz
│               ├── SRR8288057_1.fastq.gz
│               ├── SRR8288057_2.fastq.gz
│               ├── SRR8288058_1.fastq.gz
│               ├── SRR8288058_2.fastq.gz
│               ├── SRR8288059_1.fastq.gz
│               ├── SRR8288059_2.fastq.gz
│               ├── SRR8288060_1.fastq.gz
│               ├── SRR8288060_2.fastq.gz
│               ├── SRR8288061_1.fastq.gz
│               └── SRR8288061_2.fastq.gz
└── insecta
    └── mgulosa
        └── venom
            ├── input.txt
            ├── raw_reads
            │   ├── SRR6466797_1.fastq.gz
            │   └── SRR6466797_2.fastq.gz
            └── tsa.GGFG.1.fsa_nt.gz

Dependencies

Basics

Dependency Tested Version
GNU bash v5.0.11(1)
GNU awk v5.0.1
GNU sed v4.8
GNU grep v3.4
GNU make v4.3
GNU column 2.36
Miller mlr 5.4.0
bc v1.07.1
gzip v1.10
python v3.7.7
Rscript* v4.0.2

*requires tidyverse v1.3.0, glue v1.4.2, and docopt v0.7.1.

Tools

Dependency Tested Version
SRA toolkit v2.10.5
EDirect v13.8
fastp v0.20.0
RNA-Bloom v1.3.1
salmon v1.3.0
TransDecoder v5.5.0
HMMER v3.3.1
cd-hit v4.8.1
seqtk v1.1-r91
SignalP v3.0
ProP v1.0c
AMPlify v1.1.0
ENTAP v0.10.7-beta
Exonerate v2.4.0
SABLE v4.0
Clustal Omega v1.2.4

Configurations

Configuring SignalP

To download SignalP, you must enter your email address and institution. Afterwards, a download link valid for 4 hours will be emailed to you. Clicking on the link will show you one link for each system (e.g. Linux). Click the link to download, or right click to copy the link and download on the command line using curl or wget.

After moving the downloaded signalp-3.0.Linux.tar.Z file to src, decompress it:

cd src/
cat signalp-3.0.Linux.tar.Z | uncompress | tar xvf -

The file to edit is src/signalp-3.0/signalp:

Before After
SIGNALP=/usr/opt/signalp-3.0 SIGNALP=$ROOT_DIR/src/signalp-3.0
AWK=nawk AWK=awk

Note: More changes may need to be made according to what executables are accessible in your PATH variable and on your system. For FULL installation instructions, please read src/signalp-3.0/signalp-3.0.readme in detail.

The experimental scripts/helpers/install_prop.sh can be used to install SignalP with the changes listed above, but more changes may be required. Make sure that SignalP works with the test datasets in its directory before running rAMPage, e.g.

cd src/signalp-3.0
./signalp -t euk test/test.seq
Configuring ProP

To download ProP, you must enter your email address and institution. Afterwards, a download link valid for 4 hours will be emailed to you. Clicking on the link will show one link for each system (e.g. Linux). Click the link to download, or right click to copy the link and download on the command line using curl or wget.

After moving the downloaded prop-1.0c.Linux.tar.Z file to src, decompress it:

cd src/
cat prop-1.0c.Linux.tar.Z | uncompress | tar xvf -

The file to edit is src/prop-1.0c/prop:

Before After
setenv PROPHOME /usr/cbs/packages/prop/1.0c/prop-1.0c setenv PROPHOME $ROOT_DIR/src/prop-1.0c
*setenv SIGNALP /usr/cbs/bio/bin/signalp setenv SIGNALP $ROOT_DIR/src/signalp-3.0/signalp

*edit the one corresponding to your system, Linux used in the example

Note: More changes may need to be made according to what executables are accessible in your PATH variable and on your system. For FULL installation instructions, please read src/prop-1.0c/prop-1.0c.readme in detail.

The experimental scripts/helpers/install_prop.sh can be used to install ProP with the changes listed above, but more changes may be required. Make sure that ProP works with the test datasets in its directory before running rAMPage, e.g.

cd src/prop-1.0c
./prop -s test/EDA_HUMAN.fsa 
Configuring ENTAP

Download and decompress the following databases:

Database Example Download Code
RefSeq: Non-mammalian Vertebrates
(for amphibia)
wget -O vertebrate_other_protein.faa.gz ftp://ftp.ncbi.nlm.nih.gov/refseq/release/vertebrate_other/vertebrate_other.*.protein.faa.gz
RefSeq: Invertebrates
(for insecta)
wget -O invertebrate_protein.faa.gz ftp://ftp.ncbi.nlm.nih.gov/refseq/release/invertebrate/invertebrate.*.protein.faa.gz
SwissProt wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
NCBI nr wget -O nr.fasta.gz ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz

After decompression, the databases can be configured using scripts/config-entap.sh:

scripts/config-entap.sh -t 8 invertebrate_protein.faa vertebrate_other_protein.faa uniprot_sprot.fasta nr.fasta

The script configures all the databases in the EnTAP-0.10.7-beta/bin directory.

Configuring SABLE

The file to edit is src/sable_v4_distr/run.sable:

Before After
remDir=$PWD; remDir=$PWD; THREADS=$1;
export SABLE_DIR="/users/radamcza/work/newSable/sable_distr"; export SABLE_DIR="$ROOT_DIR/src/sable_v4_distr";
export BLAST_DIR="/usr/local/blast/2.2.28/bin"; export BLAST_DIR=$BLAST_DIR
*export NR_DIR="/database/ncbi/nr" export NR_DIR=$ROOT_DIR/src/EnTAP-0.10.7-beta/bin/nr
export PRIMARY_DATABASE="/users/radamcza/work/newSable/sable_distr/GI_indexes/pfam_index" export PRIMARY_DATABASE="$ROOT_DIR/src/sable_v4_distr/GI_indexes/pfam_index"
export SECONDARY_DATABASE="/users/radamcza/work/newSable/sable_distr/GI_indexes/swissprot_index" export SECONDARY_DATABASE="$ROOT_DIR/src/sable_v4_distr/GI_indexes/swissprot_index";
mkdir $PBS_JOBID mkdir -p $PBS_JOBID
/usr/bin/perl ${SABLE_DIR}/sable.pl perl ${SABLE_DIR}/sable.pl $THREADS

*After downloading the nr FASTA file (see below), it will need to be configured using BLAST+'s makeblastdb.

Optional

Dependency Tested Version
GNU wget v1.20.3
curl v7.72.0
pigz v2.4

Input

A 2 or 3-column space-delimited text file named input.txt, located in the working directory of each dataset.

Column Attribute
1 Pooling ID: generally a condition, tissue, or sex, etc.
2 Path to read 1
3 Path to read 2 (if paired-end reads)

Read paths in this input text file should be relative to the location of the input text file.

Need help downloading reads? The scripts/helpers/get-reads.sh script can be used to download reads. These dependencies are required:

Dependency Tested Version
SRA toolkit
v2.10.5
EDirect v13.8

The input runs.txt should have one SRR accession on each line.

Example: M. gulosa

POOLING ID READ 1 READ 2
venom raw_reads/SRR6466797_1.fastq.gz raw_reads/SRR6466797_2.fastq.gz

insecta/mgulosa/venom/input.txt:

venom raw_reads/SRR6466797_1.fastq.gz raw_reads/SRR6466797_2.fastq.gz

Using scripts/helpers/get-reads.sh:

scripts/helpers/get-reads.sh -o insecta/mgulosa/venom/raw_reads -p insecta/mgulosa/venom/runs.txt

insecta/mgulosa/venom/runs.txt:

SRR6466797

Example: P. toftae

POOLING ID READ 1 READ 2
liver raw_reads/SRR8288040_1.fastq.gz raw_reads/SRR8288040_2.fastq.gz
skin raw_reads/SRR8288041_1.fastq.gz raw_reads/SRR8288041_2.fastq.gz
liver raw_reads/SRR8288056_1.fastq.gz raw_reads/SRR8288056_2.fastq.gz
skin raw_reads/SRR8288057_1.fastq.gz raw_reads/SRR8288057_2.fastq.gz
liver raw_reads/SRR8288058_1.fastq.gz raw_reads/SRR8288058_2.fastq.gz
skin raw_reads/SRR8288059_1.fastq.gz raw_reads/SRR8288059_2.fastq.gz
liver raw_reads/SRR8288060_1.fastq.gz raw_reads/SRR8288060_2.fastq.gz
skin raw_reads/SRR8288061_1.fastq.gz raw_reads/SRR8288061_2.fastq.gz

amphibia/ptoftae/skin-liver/input.txt:

liver raw_reads/SRR8288040_1.fastq.gz raw_reads/SRR8288040_2.fastq.gz
skin raw_reads/SRR8288041_1.fastq.gz raw_reads/SRR8288041_2.fastq.gz
liver raw_reads/SRR8288056_1.fastq.gz raw_reads/SRR8288056_2.fastq.gz
skin raw_reads/SRR8288057_1.fastq.gz raw_reads/SRR8288057_2.fastq.gz
liver raw_reads/SRR8288058_1.fastq.gz raw_reads/SRR8288058_2.fastq.gz
skin raw_reads/SRR8288059_1.fastq.gz raw_reads/SRR8288059_2.fastq.gz
liver raw_reads/SRR8288060_1.fastq.gz raw_reads/SRR8288060_2.fastq.gz
skin raw_reads/SRR8288061_1.fastq.gz raw_reads/SRR8288061_2.fastq.gz

Using scripts/helpers/get-reads.sh:

scripts/helpers/get-reads.sh -o amphibia/ptoftae/skin-liver/raw_reads -p amphibia/ptoftae/skin-liver/runs.txt

amphibia/ptoftae/skin-liver/runs.txt:

SRR8288040
SRR8288041
SRR8288056
SRR8288057
SRR8288058
SRR8288059
SRR8288060
SRR8288061

Reference Transcriptomes

To use a reference transcriptome for the assembly stage with RNA-Bloom, put the reference in the working directory or use the -r option of scripts/rAMPage.sh.

insecta/mgulosa/venom
├── input.txt
├── raw_reads
│   ├── SRR6466797_1.fastq.gz
│   └── SRR6466797_2.fastq.gz
└── tsa.GGFG.1.fsa_nt.gzz

In this case, the reference transcriptome is a Transcriptome Shotgun Assembly for M. gulosa, downloaded from ftp://ftp.ncbi.nlm.nih.gov/genbank/tsa/G/tsa.GGFG.1.fsa_nt.gz.

Multiple references can be used as long as they are placed in the working directory.

Sources of References

Representative Genomes can be found by searching the Genome database on NCBI, using these search terms (A. mellifera, for example):

"Apis mellifera"[orgn]

Transcriptome Shotgun Assemblies can be found by searching the Nucleotide database on NCBI, using these search terms:

tsa-master[prop] "Apis mellifera"[orgn] midgut[All Fields]

Usage

The rAMPage.sh script in scripts/ runs the pipeline using a Makefile.

PROGRAM: rAMPage.sh

DESCRIPTION:
      Runs the rAMPage pipeline, using the Makefile.
      
USAGE(S):
      rAMPage.sh [-a <address>] [-b] [-c <taxonomic class>] [-d] [-f] [-h] [-m] [-n <species name>] [-o <output directory>] [-p] [-r <FASTA.gz>] [-s] [-t <int>] [-v] <input reads TXT file>
      
OPTIONS:
       -a <address>    email address for alerts                               
       -c <class>      taxonomic class of the dataset                         (default = top-level directory in $outdir)
       -d              debug mode of Makefile                                 
       -f              force characterization even if no AMPs found           
       -h              show help menu                                         
       -m <target>     Makefile target                                        (default = exonerate)
       -n <species>    taxonomic species or name of the dataset               (default = second-level directory in $outdir)
       -o <directory>  output directory                                       (default = directory of input reads TXT file)
       -p              run processes in parallel                              
       -r <FASTA.gz>   reference transcriptome                                (accepted multiple times, *.fna.gz *.fsa_nt.gz)
       -s              strand-specific library construction                   (default = false)
       -t <int>        number of threads                                      (default = 48)
       -v              print version number                                   
       -E <e-value>    E-value threshold for homology search                  (default = 1e-5)
       -S <3.0103 to 80>     AMPlify score threshold for amphibian AMPs             (default = 10)
       -L <int>        Length threshold for AMPs                              (default = 30)
       -C <int>        Charge threshold for AMPs                              (default = 2)
       -R              Disable redundancy removal during transcript assembly  
                                                                              
EXAMPLE(S):
      rAMPage.sh -a user@example.com -c class -n species -p -s -t 8 -o /path/to/output/directory -r /path/to/reference.fna.gz -r /path/to/reference.fsa_nt.gz /path/to/input.txt 
      
INPUT EXAMPLE:
       tissue /path/to/readA_1.fastq.gz /path/to/readA_2.fastq.gz
       tissue /path/to/readB_1.fastq.gz /path/to/readB_2.fastq.gz
       
MAKEFILE TARGETS:
       01) check        08) homology
       02) reads        09) cleavage
       03) trim         10) amplify
       04) readslist    11) annotation
       05) assembly     12) exonerate
       06) filtering    13) sable
       07) translation  14) all

DESCRIPTION:
      Runs the rAMPage pipeline, using the Makefile.
      
USAGE(S):
      rAMPage.sh [-a <address>] [-b] [-c <taxonomic class>] [-d] [-f] [-h] [-m] [-n <species name>] [-o <output directory>] [-p] [-r <FASTA.gz>] [-s] [-t <int>] [-v] <input reads TXT file>
      
OPTIONS:
       -a <address>    email address for alerts                               
       -c <class>      taxonomic class of the dataset                         (default = top-level directory in $outdir)
       -d              debug mode of Makefile                                 
       -f              force characterization even if no AMPs found           
       -h              show help menu                                         
       -m <target>     Makefile target                                        (default = exonerate)
       -n <species>    taxonomic species or name of the dataset               (default = second-level directory in $outdir)
       -o <directory>  output directory                                       (default = directory of input reads TXT file)
       -p              run processes in parallel                              
       -r <FASTA.gz>   reference transcriptome                                (accepted multiple times, *.fna.gz *.fsa_nt.gz)
       -s              strand-specific library construction                   (default = false)
       -t <int>        number of threads                                      (default = 48)
       -v              print version number                                   
       -E <e-value>    E-value threshold for homology search                  (default = 1e-5)
       -S <3.0103 to 80>     AMPlify score threshold for amphibian AMPs             (default = 10)
       -L <int>        Length threshold for AMPs                              (default = 30)
       -C <int>        Charge threshold for AMPs                              (default = 2)
       -R              Disable redundancy removal during transcript assembly  
                                                                              
EXAMPLE(S):
      rAMPage.sh -a user@example.com -c class -n species -p -s -t 8 -o /path/to/output/directory -r /path/to/reference.fna.gz -r /path/to/reference.fsa_nt.gz /path/to/input.txt 
      
INPUT EXAMPLE:
       tissue /path/to/readA_1.fastq.gz /path/to/readA_2.fastq.gz
       tissue /path/to/readB_1.fastq.gz /path/to/readB_2.fastq.gz
       
MAKEFILE TARGETS:
       01) check        08) homology
       02) reads        09) cleavage
       03) trim         10) amplify
       04) readslist    11) annotation
       05) assembly     12) exonerate
       06) filtering    13) sable
       07) translation  14) all

Choosing Thresholds

The best way to choose score, length, and score thresholds is to plot the distribution of the reference AMPs.

scripts/helpers/plot-dist.sh -a amphibianAMPs.faa -i insectAMPs.faa -t 8 -o /path/to/output/dir -r

Running from the root of the repository

Example: M. gulosa (stranded library construction)

scripts/rAMPage.sh -v -s -o insecta/mgulosa/venom -r insecta/mgulosa/venom/tsa.GGFG.1.fsa_nt.gz -c insecta -n mgulosa insecta/mgulosa/venom/input.txt

In the example above, the -o insecta/mgulosa/venom argument is optional, since the default will be set as parent directory of the input.txt file. This option is a safeguard for the scenario where input.txt is not located in the working directory. In this case, the -o option will move input.txt and provided references to the working directory.

rAMPage will use all *.fsa_nt* and *.fna* files located in the working directory as references in the assembly stage, regardless of if the -r option is used or not. This option is a safeguard for the scenario where the references provided are not located in the working directory. In this case, the -r option will move the references to the working directory.

Running from the working directory of the dataset

Example: M. gulosa (stranded library construction)

$ROOT_DIR/scripts/rAMPage.sh -s -r tsa.GGFG.1.fsa_nt.gz -c insecta -n mgulosa input.txt

Running multiple datasets from the root of the repository

To run rAMPage on multiple datasets, you can use the stAMPede.sh wrapper script. By default, stAMPede.sh will run rAMPage on the datasets consecutively. If the -s option is invoked, they will be run simultaenously in parallel. The -p option allows parallelization of certain processes, such as trimming reads in parallel.

Note: This script is experimental and has fewer options than running rAMPage.sh.

PROGRAM: stAMPede.sh

DESCRIPTION:
      A wrapper around rAMPage.sh to allow running of multiple assemblies.
      
USAGE(S):
      stAMPede.sh [-a <address>] [-d] [-h] [-m] [-p] [-s] [-t <int>] [-v] <accessions TXT file>
      
OPTION(S):
       -a <address>  email address for alerts                                   
       -d            debug mode                                                 
       -h            show help menu                                             
       -m <target>   Makefile target                                            (default = exonerate)
       -p            allow parallel processes for each dataset                  
       -s            simultaenously run rAMPAge on all datasets                 (default if SLURM available)
       -t <int>      number of threads                                          (default = 48)
       -v            verbose (uses /usr/bin/time -pv to time each rAMPage run)  
       -E <e-value>  E-value threshold for homology search                      (default = 1e-5)
       -S <3.0103 to 80>   AMPlify score threshold for amphibian AMPs                 (default = 10)
       -L <int>      Length threshold for AMPs                                  (default = 30)
       -C <int>      Charge threshold for AMPs                                  (default = 2)
                                                                                
ACCESSIONS TXT FORMAT:
       CLASS/SPECIES/TISSUE_OR_CONDITION/input.txt strandedness
       amphibia/ptoftae/skin-liver/input.txt nonstranded
       insecta/mgulosa/venom/input.txt stranded
       
EXAMPLE(S):
      stAMPede.sh -a user@example.com -p -s -v accessions.txt

Input

For running multiple datasets, the multi-input text file should be a 2-column text file:

Column Attribute
1 path to input.txt file
2 stranded or nonstranded

Example: P. toftae and M. gulosa

input.txt strandedness
amphibia/ptoftae/skin-liver/input.txt nonstranded
insecta/mgulosa/venom/input.txt stranded

multi-input.txt:

amphibia/ptoftae/skin-liver/input.txt nonstranded
insecta/mgulosa/venom/input.txt stranded

AMPs for Synthesis

For reproducibility, clustering AMPs across datasets and choosing AMPs for synthesis are included in the scripts/stAMPede.sh script, but manual clustering can be done using scripts/cluster.sh:

scripts/cluster.sh -o /path/to/outdir amphibia/ptoftae/skin-liver/exonerate insecta/mgulosa/venom/exonerate

These are the 90 AMPs selected for synthesis, but only a subset of 21 have been validated in vitro thus far.

Directory Structure

Example directory structure:

rAMPage
├── amphibia
│   └── ptoftae
│       └── skin-liver
├── amp_seqs
├── insecta
│   └── mgulosa
│       └── venom
├── scripts
└── src

Citation

Lin, D.; Sutherland, D.; Aninta, S.I.; Louie, N.; Nip, K.M.; Li, C.; Yanai, A.; Coombe, L.; Warren, R.L.; Helbing, C.C.; et al. Mining Amphibian and Insect Transcriptomes for Antimicrobial Peptide Sequences with rAMPage. Antibiotics 2022, 11, 952. https://doi.org/10.3390/antibiotics11070952