Prerequisites
Ensure that you have the following prerequisites installed on your system:
Python (version 3.8 or higher)
R (version 4 or higher)
NCBI AMRfinderPlus
We suggest that ReGAIN and all prerequisites are installed within a Conda environment
Download miniforge
Create Conda environment and install NCBI AMRfinderPlus
conda create -n regain python=3.10
conda activate regain
Install AMRfinderPlus
conda install -y -c conda-forge -c bioconda ncbi-amrfinderplus
Check installation
amrfinder -h
Download ARMfinderPlus Database
amrfinder -u
Download ReGAIN to preferred directory
git clone https://github.com/ERBringHorvath/regain_cl
Install Python dependencies
pip install -r requirements.txt
or pip3 install -r requirements.txt
Add ReGAIN to your PATH
Add this line to the end of .bash_profile
(Linux/Unix) or .zshrc
(macOS):
export PATH="$PATH:/path/to/regain/bin"
Replace /path/to/regain/bin
with the actual path to the directory containing the executable.
Whatever the initial directory, this path should end with /regain/bin
Save the file and restart your terminal or run source ~/.bash_profile
or source ~/.zshrc
Verify installation:
regain --version
use -h
, --help
, to bring up the help menu
regain --help
Module 1.1 regain AMR
-d
, --directory
, path to directory containing FASTA files to analyze
-O
, --organism
, specify what organism (if any) you want to analyze (optional flag)
-T
, --threads
, number of cores to dedicate
-o
, --output-dir
, output directory to store AMRfinder results
Currently supported organisms and how they should be called:
Acinetobacter_baumannii
Burkholderia_cepacia
Burkholderia_pseudomallei
Campylobacter
Clostridioides_difficile
Enterococcus_faecalis
Enterococcus_faecium
Escherichia
Klebsiella
Neisseria
Pseudomonas_aeruginosa
Salmonella
Staphylococcus_aureus
Staphylococcus_pseudintermedius
Streptococcus_agalactiae
Streptococcus_pneumoniae
Streptococcus_pyogenes
Vibrio_cholerae
Module 1.1 example usage:
Organism specific:
regain AMR -d path/to/FASTA/files -O Pseudomonas_aruginosa -T 8 -o path/to/output/directory
Organism non-specific:
regain AMR -d path/to/FASTA/files -T 8 -o path/to/output/directory
NOTE: variable names cannot contain special characters–but don't worry, we've taken care of that!
To replace special characters during dataset creation, include --simplify-gene-names
in the command!
Module 1.2 regain matrix
-d
, --directory
, path to AMRfinder results in CSV format
-s
, --search-strings-output
, name of output file where gene names will be stored
--gene-type
, searches for resistance
or virulence
genes
-f
, --search-output
, presence/absence matrix of all genes in your dataset, regardless of --min
/--max
values
--min
, minimum gene occurrence cutoff
--max
, maximum gene occurrence cutoff (should be less than number of genomes, see NOTE below)
--simplify-gene-names
, replaces special characters in gene names, i.e., aph(3’’)-Ib becomes aph3pp_Ib. This is required for the Bayesian network structure learning module
-o
, --output
, output of final curated presence/absence matrix
--verbose
, reports actual variable counts, overwriting binary output
Module 1.2 example usage
NOTE: Discrete Bayesian network anlyses requires all variables to exist in at least two states. For ReGAIN, these two states are 'present' and 'absent'. Ubiquitously occurring genes will break the analysis.
Best practice is for N genomes, --max
should MINIMALLY be defined as N - 1. Keep in mind that removing very low and very high abundance genes can reduce noise in the network.
regain matrix -d path/to/AMRfinder/results -s search_strings --simplify-gene-names --gene-type
resistance -f matrix.csv --min 5 --max 500 -o matrix_final.csv
NOTE: all results are saved in the 'ReGAIN_Dataset' folder, which will be generated within the directory defined by -d
Module 2 regain bnL
or regain bnS
-i
, --input
, input file in CSV format
-M
, --metadata
, file containing gene names and descriptions
-o
, --output_boot
, output bootstrap file
-T
, --threads
, number of cores to dedicate
-n
, --number_of_boostraps
, how many bootstraps to run (suggested 300-500)
-r
, --number-of-resamples
, how many data resamples you want to use (suggested 100)
Module 2 example usage:
NOTE: We suggest using between 300 and 500 bootstraps and minimum 100 resamples
bnS
, Bayesian network structure learning analysis for less than 100 genes
bnL
, Bayesian network structure learning analysis for 100 genes or greater
For less than 100 genes:
regain bnS regain bnS -i matrix.csv -M metadata.csv -o bootstrapped_network -T 8 -n 500 -r 100
For 100 or more genes:
regain bnL -i matrix.csv -M metadata.csv -o bootstrapped_network -T 8 -n 500 -r 100
Optional Module 3 regain MVA
Currently supported measures of distance:
manhattan
, euclidean
, canberra
, clark
, bray
, kulczynski
, jaccard
, gower
,
horn
, mountford
, raup
, binomial
, chao
, cao
, mahalanobis``altGower
, morisita
,
chisq
, chord
, hellinger
-i
, --input
, input file in CSV format
-m
, --method
, measure of distance method
-c
, --centers
, how many centers you want for your multidimensional analysis (1-10)
-C
, --confidence
, confidence interval for ellipses
Module 3 example usage:
regain MVA -i matrix.csv -m jaccard -c 3 -C 0.75
NOTE: the MVA analysis will generate 2 files: a PNG and a PDF of the plot
Bayesian network analysis requires both data matrix and metadata files. MVA analysis requires only a data matrix file.
Resistance Gene Association and Inference Network (ReGAIN): A Bioinformatics Pipeline for Assessing Probabilistic Co-Occurrence Between Resistance Genes in Bacterial Pathogens Elijah R. Bring Horvath, Mathew G. Stein, Matthew A Mulvey, Edgar Javier Hernandez, Jaclyn M. Winter bioRxiv 2024.02.26.582197; doi: https://doi.org/10.1101/2024.02.26.582197