GitHub - HaidYi/acrfinder: AcrFinder, a tool for automated identification of Acr-Aca loci

AcrFinder

(c) Yin Lab@UNL2019

I. Installation / Dependencies

Dependencies

Clone/download the repository. Some dependencies are included and can be found in the dependencies/ directory. Program expects these versions and using other versions can result in unexpected behavior.

CRISPRCasFinder - Already in dependencies/ directory. To use CRISPRCasFinder on your machine make sure you run its install script. The manual can be found here. Running the install script will setup paths for all the dependencies of CRISPRCasFinder.

It is a common problem to forget to install CRISPRCasFinder, so ensure that CRISPRCasFinder runs properly before executing acr_aca_cri_runner.py to avoid errors.

blastn - acr_aca_cri_runner.py will call/use blastn to search a genome. Install blastn from NCBI.

psiblast+ - Used with CDD to find mobilome proteins. Install at NCBI

blastp - Used with prophage database to find prophage. Install blastp from NCBI

python3 - For all scripts with .py extension. Use any version at or above 3.4.

PyGornism - Already in dependencies/ directory. Used to parse organism files and generate organism files in certain formats.

Database Preparation

After git clone the repository, there are 3 database to be installed.

Prophage

cd dependencies/prophage && makeblastdb -in prophage_virus.db -dbtype prot -out prophage

CDD-MGE

cd dependencies/ && tar -xzf cdd-mge.tar.gz && rm cdd-mge.tar.gz

CDD

mkdir -p dependencies/cdd
cd dependencies/cdd && wget ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/cdd.tar.gz && tar -xzf cdd.tar.gz && rm cdd.tar.gz
makeprofiledb -title CDD.v.3.12 -in Cdd.pn -out Cdd -threshold 9.82 -scale 100.0 -dbtype rps -index true

II. About

AcrFinder is a tool used to identify Anti-CRISPR proteins (Acr) using both sequence homology and guilt-by-association approaches.

This README file contains information about only the python scripts found in the current directory. These are the scripts that are used to identify genomic loci that contain Acr and/or Aca homologs.

To find out how to use other dependencies look at online sources:

CRISPRCasFinder - https://crisprcas.i2bc.paris-saclay.fr/CrisprCasFinder/Index

*CRISPRCasFinder is used to identify CRISPR Cas systems. This will then be used to classify the genomic loci that contain Acr and/or Aca homologs. If no CRISPR Cas systems are found within a genome, then only homology based search will be implemented for Acr homologs.

III. Using AcrFinder

Input

AcrFinder needs .fna, .gff and .faa as input. Only .fna file as input is also acceptable; in that case, the .gff and .faa file will be generated by running Prodigal.

List of Options

Option	Alternative	Purpose
-h	--help	Shows all available options
-n	--inFNA	Required fna file
-f	--inGFF	Required Path to gff file to use/parse
-a	--inFAA	Required Path to faa file to use/parse
-m	--aaThresh	Max size of a protein in order to be considered Aca/Acr (aa) {default = 200} [integer]
-d	--distThresh	Max intergenic distance between proteins (bp) {default = 150} [integer]
-r	--minProteins	Min number of proteins needed per locus {default = 2} [integer]
-y	--arrayEvidence	Minimum evidence level needed of a CRISPR spacer to use {default = 3} [integer]
-o	--outDir	Path to output directory to store results in. If not provided, the program will attempt to create a new one with given path
-t	--aca	Known Aca file (.faa) to diamond candidate aca in candidate Acr-Aca loci
-u	--acr	Known Acr file (.faa) to diamond the homolog of Acr
-z	--genomeType	How to treat the genome. There are three options: Virus, Bacteria and Archaea. Viruses will not run `CRISPRCasFinder` (Note: when virus is checked, also check `-c 0` such that no mge search for virus.), Archaea will run `CRISPRCasFinder` with a special Archaea flag (-ArchaCas), Bacteria will use `CRISPRCasFinder` without the Archaea flag {default = V} [string]
-e	--proteinUpDown	Number of surrounding (up- and down-stream) proteins to use when gathering a neighborhood {default = 10} [integer]
-c	--minCDDProteins	Minimum number of proteins in neighborhood that must have a CDD mobilome hit so the Acr/Aca locus can be attributed to a CDD hit {default = 1} [integer]
-g	--gi	Uses IslandViewer (GI) database. {default = false} [boolean]
-p	--prophage	Uses PHASTER (prophage) database. {default = false} [boolean]
-s	--strict	All proteins in locus must lie within a region found in DB(s) being used {default = false} [boolean]
-l	--lax	Only one protein must lie within a region found in DB(s) being used {default = true} [boolean]
--blsType	None	Which blast type to choose when searching mobile genome element (mge). {default = blastp} Possible choices: blastp or rpsblast
--identity	None	The --id (identity) parameter for diamond to search {default=30} [integer]
--coverage	None	The --query-cover parameter for diamond to search {default=0.8} [float]
--e_value	None	The -e (e-value) parameter for diamond to search {default=0.01} [float]
--blast_slack	None	how far an Acr/Aca locus is allowed to be from a blastn hit to be considered high confidence {default=5000}

Output

Classification

There are three levels of classification in output:

Classification	Meaning
Low Confidence	If this Acr-Aca locus has a CRISPR-Cas locus but no self-targeting spacers in the genome, it is labeled as “low confidence” and inferred to target the CRISPR-Cas locus.
Medium Confidence	If this Acr-Aca locus has a self-targeting spacer target in the genome but not nearby, it is labeled as “medium confidence” and inferred to target the CRISPR-Cas locus with the self-targeting spacer. "Nearby" means within 5,000 BP.
High Confidence	If this Acr-Aca locus has a nearby self-targeting spacer target, it is labeled as “high confidence” and inferred to target the CRISPR-Cas locus with the self-targeting spacer.

Ouput files

Name	Meaning
<output_dir>/CRISPRCas_OUTPUT	The output folder of CRISPRCasFinder
<output_dir>/subjects	The folder that contains the input files
<output_dir>/intermediates	The folder that contains intermediate result files
<output_dir>/intermediates/blast_out.txt	Results from blast+
<output_dir>/<organism_id>_guilt-by-association.out	The final set of Acr/Aca regions that passed the initial filters as well as the CDD mobilome and prophage/gi filters.
<output_dir>/<organism_id>_homology_based.out	The final set of proteins that have similarity to proteins in the Acr database under given similarity threshold.
<output_dir>/intermediates/masked_db/	The directory contains the db (fna with crispr array regions masked) to be used for blastn search for self-targeting spacer matches (the database for blastn search)
<output_dir>/intermediates/spacers_with_desired_evidence.fna	The file contains CRISPR spacers extracted from crisprcasfinder results that have the desired evidence level. The query for blastn search
<output_dir>/intermediates/<organism_id>_candidate_acr_aca.txt	Potential Acr/Aca regions that passed initial filters.
<output_dir>/intermediates/<organism_id>_candidate_acr_aca.faa	Potential Acr/Aca regions in an faa format.
<output_dir>/intermediates/<organism_id>_candidate_acr_aca_neighborhood.faa	An extension of the previous file that also inludes the neighboring proteins of the potential Acr/Aca. Used as the query for blastp search against prophage.
<output_dir>/intermediates/<organism_id>candidate_acr_aca{blastp/rpsblast}_results.txt	Result file from blastp against prophage database or rpsblast against cdd-mge database.
<output_dir>/intermediates/<organism_id>_candidate_acr_aca_diamond_result.txt	Results of diamond. These are search results with the Aca database as the query and <output_dir>/intermediates/<organism_id>_candidate_acr_aca.faa as the database.
<output_dir>/intermediates/<organism_id>_candidate_acr_homolog_result.txt	Results of diamond. These are search results with the Acr database as the query and <output_dir>/intermediates/<organism_id>_candidate_acr_aca.faa as the database.
<output_dir>/intermediates/<organism_id>_candidate_acr_aca_diamond_database.dmnd	Database of diamond made from <organism_id>_candidate_acr_aca.faa file.
<output_dir>/intermediates/<organism_id>_acr_homolog_result.txt	Results of diamond. These are search results with the Acr database as the query and <output_dir>/subjects/<organism_id>_protein.faa as the database.
<output_dir>/intermediates/<organism_id>_acr_homolog_result.fasta	Protein Sequence file (.faa) of protein in <output_dir>/intermediates/<organism_id>_acr_homolog_result.txt
<output_dir>/intermediates/<organism_id>_acr_diamond_database.dmnd	Database of diamond made from <output_dir>/subjects/<organism_id>_protein.faa file

IV. Docker Support

To help users to configure the environment to use the software easily, we provide the .Dockerfile can be used using the command ([tag name] indicates the name of the tag. You can set any tag name.):

git clone https://github.com/haidyi/acrfinder.git
cd acrfinder
docker build -t [tag name] .

If you don't want to build the image by yourself, AcrFinder is also available at Docker Hub. You can pull the AcrFinder from docker hub directly using the command:

docker pull [OPTIONS] haidyi/acrfinder:latest

V. Examples

python3 acr_aca_cri_runner.py -n sample_organisms/GCF_000210795.2/GCF_000210795.2_genomic.fna -f sample_organisms/GCF_000210795.2/GCF_000210795.2_genomic.gff -a sample_organisms/GCF_000210795.2/GCF_000210795.2_protein.faa -o [output_dir] -z B -c 2 -p true -g true

or you can only use .fna file as input.

python3 acr_aca_cri_runner.py -n sample_organisms/GCF_000210795.2/GCF_000210795.2_genomic.fna -o [output_dir] -z B -c 2 -p true -g true

Run the container

Firstly, make sure the docker image has been pulled from the docker hub or built by yourself. AcrFinder is located at the work directory of the container.

Interactive Usage

docker run [OPTIONS] [NAME:TAG] /bin/bash
python3 acr_aca_cri_runner.py -n sample_organisms/GCF_000210795.2/GCF_000210795.2_genomic.fna -f sample_organisms/GCF_000210795.2/GCF_000210795.2_genomic.gff -a sample_organisms/GCF_000210795.2/GCF_000210795.2_protein.faa -o [output dir] -z B -c 2 -p true -g true

Use own sequence

If you want to use your own sequence for analysis, you can use the flag -v in docker to load your the host directory to the containder. The entire command is like this:

For example, if you want to use GCF_000210795.2 (contain .fna,gff,faa file in the directory ~/GCF_000210795.2) to implement acrfinder algorithm, you can use the command below:

docker run --rm -it -v ~/GCF_000210795.2:/app/acrfinder/GCF_000210795.2 haidyi/acrfinder:latest python3 acr_aca_cri_runner.py -n GCF_000210795.2/GCF_000210795.2_genomic.fna -f GCF_000210795.2/GCF_000210795.2_genomic.gff -a GCF_000210795.2/GCF_000210795.2_protein.faa -o GCF_000210795.2/output_dir -z B -c 2 -p true -g true

Then, you will see the output result in ~/GCF_000210795.2/output_dir.

For more information about how to use docker, you can refer to https://docs.docker.com.

VI. Workflow of AcrFinder

VII. FAQ

Q) I ran acr_aca_cri_runner.py and I got errors that pertain to CRISPR/Cas. Whats the issue?

A) Make sure CRIPSRCasFinder is installed properly. CRIPSRCasFinder has many dependencies of its own and will only work if they are all installed correctly. A good indicator of a correctly installed CRIPSRCasFinder is the following terminal output:

################################################################
# --> Welcome to dependencies/CRISPRCasFinder/CRISPRCasFinder.pl (version 4.2.17)
################################################################


vmatch2 is...............OK
mkvtree2 is...............OK
vsubseqselect2 is...............OK
fuzznuc (from emboss) is...............OK
needle (from emboss) is...............OK

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
dependencies		dependencies
sample_organisms		sample_organisms
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
acr_aca.py		acr_aca.py
acr_aca_cri_runner.py		acr_aca_cri_runner.py
acr_aca_finder.py		acr_aca_finder.py
command_options.py		command_options.py
crispr_cas_runner.py		crispr_cas_runner.py
fastafy_select_spacers.py		fastafy_select_spacers.py
find_candidate_acr_aca.py		find_candidate_acr_aca.py
mask_fna_with_spacers.py		mask_fna_with_spacers.py
parse_acr_aca_with_cdd.py		parse_acr_aca_with_cdd.py
parse_acr_aca_with_db.py		parse_acr_aca_with_db.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Contents:

I. Installation / Dependencies

Dependencies

Database Preparation

Prophage

CDD-MGE

CDD

II. About

AcrFinder is a tool used to identify Anti-CRISPR proteins (Acr) using both sequence homology and guilt-by-association approaches.

III. Using AcrFinder

Input

List of Options

Output

Classification

Ouput files

IV. Docker Support

V. Examples

Run the container

Interactive Usage

Use own sequence

VI. Workflow of AcrFinder

VII. FAQ

About

Releases

Packages

Contributors 3

Languages

License

HaidYi/acrfinder

Folders and files

Latest commit

History

Repository files navigation

Contents:

I. Installation / Dependencies

Dependencies

Database Preparation

Prophage

CDD-MGE

CDD

II. About

AcrFinder is a tool used to identify Anti-CRISPR proteins (Acr) using both sequence homology and guilt-by-association approaches.

III. Using AcrFinder

Input

List of Options

Output

Classification

Ouput files

IV. Docker Support

V. Examples

Run the container

Interactive Usage

Use own sequence

VI. Workflow of AcrFinder

VII. FAQ

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages