ReGAIN Installation and User guide



Ensure that you have the following prerequisites installed on your system:

Python (version 3.8 or higher)

R (version 4 or higher)

NCBI AMRfinderPlus

Install R

We suggest that ReGAIN and all prerequisites are installed within a Conda environment

Download miniforge

Create Conda environment and install NCBI AMRfinderPlus

conda create -n regain python=3.10

conda activate regain

Install AMRfinderPlus

conda install -y -c conda-forge -c bioconda ncbi-amrfinderplus

Check installation

amrfinder -h

Download ARMfinderPlus Database

amrfinder -u

Download ReGAIN to preferred directory

git clone

Install Python dependencies

pip install -r requirements.txt or pip3 install -r requirements.txt

Add ReGAIN to your PATH

Add this line to the end of .bash_profile (Linux/Unix) or .zshrc (macOS):

export PATH="$PATH:/path/to/regain/bin"

Replace /path/to/regain/bin with the actual path to the directory containing the executable.
Whatever the initial directory, this path should end with /regain/bin

Save the file and restart your terminal or run source ~/.bash_profile or source ~/.zshrc

Verify installation:

regain --version

use -h, --help, to bring up the help menu

regain --help

Programs and Example Usage

Resistance and Virulence Gene Identification

Module 1.1 regain AMR

-d, --directory, path to directory containing FASTA files to analyze
-O, --organism, specify what organism (if any) you want to analyze (optional flag)
-T, --threads, number of cores to dedicate
-o, --output-dir, output directory to store AMRfinder results

Currently supported organisms and how they should be called:


Module 1.1 example usage:

Organism specific:

regain AMR -d path/to/FASTA/files -O Pseudomonas_aruginosa -T 8 -o path/to/output/directory

Organism non-specific:

regain AMR -d path/to/FASTA/files -T 8 -o path/to/output/directory

Dataset Creation

NOTE: variable names cannot contain special characters–but don't worry, we've taken care of that!
To replace special characters during dataset creation, include --simplify-gene-names in the command!

Module 1.2 regain matrix

-d, --directory, path to AMRfinder results in CSV format
-s, --search-strings-output, name of output file where gene names will be stored
--gene-type, searches for resistance or virulence genes
-f, --search-output, presence/absence matrix of all genes in your dataset, regardless of --min/--max values
--min, minimum gene occurrence cutoff
--max, maximum gene occurrence cutoff (should be less than number of genomes, see NOTE below)
--simplify-gene-names, replaces special characters in gene names, i.e., aph(3’’)-Ib becomes aph3pp_Ib. This is required for the Bayesian network structure learning module
-o, --output, output of final curated presence/absence matrix
--verbose, reports actual variable counts, overwriting binary output

Module 1.2 example usage

NOTE: Discrete Bayesian network anlyses requires all variables to exist in at least two states. For ReGAIN, these two states are 'present' and 'absent'. Ubiquitously occurring genes will break the analysis. Best practice is for N genomes, --max should MINIMALLY be defined as N - 1. Keep in mind that removing very low and very high abundance genes can reduce noise in the network.

regain matrix -d path/to/AMRfinder/results -s search_strings --simplify-gene-names --gene-type
resistance -f matrix.csv --min 5 --max 500 -o matrix_final.csv

NOTE: all results are saved in the 'ReGAIN_Dataset' folder, which will be generated within the directory defined by -d

Bayesian Network Structure Learning

Module 2 regain bnL or regain bnS

-i, --input, input file in CSV format
-M, --metadata, file containing gene names and descriptions
-o, --output_boot, output bootstrap file
-T, --threads, number of cores to dedicate
-n, --number_of_boostraps, how many bootstraps to run (suggested 300-500)
-r, --number-of-resamples, how many data resamples you want to use (suggested 100)

Module 2 example usage:

NOTE: We suggest using between 300 and 500 bootstraps and minimum 100 resamples

bnS, Bayesian network structure learning analysis for less than 100 genes
bnL, Bayesian network structure learning analysis for 100 genes or greater

For less than 100 genes:

regain bnS regain bnS -i matrix.csv -M metadata.csv -o bootstrapped_network -T 8 -n 500 -r 100

For 100 or more genes:

regain bnL -i matrix.csv -M metadata.csv -o bootstrapped_network -T 8 -n 500 -r 100

Multidimensional Analyses

Optional Module 3 regain MVA

Currently supported measures of distance:

manhattan, euclidean, canberra, clark, bray, kulczynski, jaccard, gower,
horn, mountford, raup, binomial, chao, cao, mahalanobis``altGower, morisita,
chisq, chord, hellinger

-i, --input, input file in CSV format
-m, --method, measure of distance method
-c, --centers, how many centers you want for your multidimensional analysis (1-10)
-C, --confidence, confidence interval for ellipses

Module 3 example usage:

regain MVA -i matrix.csv -m jaccard -c 3 -C 0.75

NOTE: the MVA analysis will generate 2 files: a PNG and a PDF of the plot

Formatting External Data

Bayesian network analysis requires both data matrix and metadata files. MVA analysis requires only a data matrix file.


Citing ReGAIN

Resistance Gene Association and Inference Network (ReGAIN): A Bioinformatics Pipeline for Assessing Probabilistic Co-Occurrence Between Resistance Genes in Bacterial Pathogens Elijah R. Bring Horvath, Mathew G. Stein, Matthew A Mulvey, Edgar Javier Hernandez, Jaclyn M. Winter bioRxiv 2024.02.26.582197; doi:


