AcrFinder (c) Yin Lab@UNL2019
I. Installation / Dependencies
Clone/download the repository. Some dependencies are included and can be found in the dependencies/ directory. Program expects these versions and using other versions can result in unexpected behavior.
CRISPRCasFinder
- Already in dependencies/ directory. To use CRISPRCasFinder
on your machine make sure you run its install script. The manual can be found here. Running the install script will setup paths for all the dependencies of CRISPRCasFinder
.
It is a common problem to forget to install CRISPRCasFinder
, so ensure that CRISPRCasFinder
runs properly before executing acr_aca_cri_runner.py to avoid errors.
blastn
- acr_aca_cri_runner.py will call/use blastn
to search a genome. Install blastn
from NCBI.
psiblast+
- Used with CDD to find mobilome proteins. Install at NCBI
blastp
- Used with prophage database to find prophage. Install blastp
from NCBI
python3
- For all scripts with .py extension. Use any version at or above 3.4.
PyGornism
- Already in dependencies/ directory. Used to parse organism files and generate organism files in certain formats.
After git clone the repository, there are 3 database to be installed.
cd dependencies/prophage && makeblastdb -in prophage_virus.db -dbtype prot -out prophage
cd dependencies/ && tar -xzf cdd-mge.tar.gz && rm cdd-mge.tar.gz
mkdir -p dependencies/cdd
cd dependencies/cdd && wget ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/cdd.tar.gz && tar -xzf cdd.tar.gz && rm cdd.tar.gz
makeprofiledb -title CDD.v.3.12 -in Cdd.pn -out Cdd -threshold 9.82 -scale 100.0 -dbtype rps -index true
AcrFinder is a tool used to identify Anti-CRISPR proteins (Acr) using both sequence homology and guilt-by-association approaches.
This README file contains information about only the python scripts found in the current directory. These are the scripts that are used to identify genomic loci that contain Acr and/or Aca homologs.
To find out how to use other dependencies look at online sources:
CRISPRCasFinder
- https://crisprcas.i2bc.paris-saclay.fr/CrisprCasFinder/Index
*CRISPRCasFinder
is used to identify CRISPR Cas systems. This will then be used to classify the genomic loci that contain Acr and/or Aca homologs. If no CRISPR Cas systems are found within a genome, then only homology based search will be implemented for Acr homologs.
AcrFinder needs .fna, .gff and .faa as input. Only .fna file as input is also acceptable; in that case, the .gff and .faa file will be generated by running Prodigal.
Option | Alternative | Purpose |
---|---|---|
-h | --help | Shows all available options |
-n | --inFNA | Required fna file |
-f | --inGFF | Required Path to gff file to use/parse |
-a | --inFAA | Required Path to faa file to use/parse |
-m | --aaThresh | Max size of a protein in order to be considered Aca/Acr (aa) {default = 200} [integer] |
-d | --distThresh | Max intergenic distance between proteins (bp) {default = 150} [integer] |
-r | --minProteins | Min number of proteins needed per locus {default = 2} [integer] |
-y | --arrayEvidence | Minimum evidence level needed of a CRISPR spacer to use {default = 3} [integer] |
-o | --outDir | Path to output directory to store results in. If not provided, the program will attempt to create a new one with given path |
-t | --aca | Known Aca file (.faa) to diamond candidate aca in candidate Acr-Aca loci |
-u | --acr | Known Acr file (.faa) to diamond the homolog of Acr |
-z | --genomeType | How to treat the genome. There are three options: Virus, Bacteria and Archaea. Viruses will not run CRISPRCasFinder (Note: when virus is checked, also check -c 0 such that no mge search for virus.), Archaea will run CRISPRCasFinder with a special Archaea flag (-ArchaCas), Bacteria will use CRISPRCasFinder without the Archaea flag {default = V} [string] |
-e | --proteinUpDown | Number of surrounding (up- and down-stream) proteins to use when gathering a neighborhood {default = 10} [integer] |
-c | --minCDDProteins | Minimum number of proteins in neighborhood that must have a CDD mobilome hit so the Acr/Aca locus can be attributed to a CDD hit {default = 1} [integer] |
-g | --gi | Uses IslandViewer (GI) database. {default = false} [boolean] |
-p | --prophage | Uses PHASTER (prophage) database. {default = false} [boolean] |
-s | --strict | All proteins in locus must lie within a region found in DB(s) being used {default = false} [boolean] |
-l | --lax | Only one protein must lie within a region found in DB(s) being used {default = true} [boolean] |
--blsType | None | Which blast type to choose when searching mobile genome element (mge). {default = blastp} Possible choices: blastp or rpsblast |
--identity | None | The --id (identity) parameter for diamond to search {default=30} [integer] |
--coverage | None | The --query-cover parameter for diamond to search {default=0.8} [float] |
--e_value | None | The -e (e-value) parameter for diamond to search {default=0.01} [float] |
--blast_slack | None | how far an Acr/Aca locus is allowed to be from a blastn hit to be considered high confidence {default=5000} |
There are three levels of classification in output:
Classification | Meaning |
---|---|
Low Confidence | If this Acr-Aca locus has a CRISPR-Cas locus but no self-targeting spacers in the genome, it is labeled as “low confidence” and inferred to target the CRISPR-Cas locus. |
Medium Confidence | If this Acr-Aca locus has a self-targeting spacer target in the genome but not nearby, it is labeled as “medium confidence” and inferred to target the CRISPR-Cas locus with the self-targeting spacer. "Nearby" means within 5,000 BP. |
High Confidence | If this Acr-Aca locus has a nearby self-targeting spacer target, it is labeled as “high confidence” and inferred to target the CRISPR-Cas locus with the self-targeting spacer. |
Name | Meaning |
---|---|
<output_dir>/CRISPRCas_OUTPUT | The output folder of CRISPRCasFinder |
<output_dir>/subjects | The folder that contains the input files |
<output_dir>/intermediates | The folder that contains intermediate result files |
<output_dir>/intermediates/blast_out.txt | Results from blast+ |
<output_dir>/<organism_id>_guilt-by-association.out | The final set of Acr/Aca regions that passed the initial filters as well as the CDD mobilome and prophage/gi filters. |
<output_dir>/<organism_id>_homology_based.out | The final set of proteins that have similarity to proteins in the Acr database under given similarity threshold. |
<output_dir>/intermediates/masked_db/ | The directory contains the db (fna with crispr array regions masked) to be used for blastn search for self-targeting spacer matches (the database for blastn search) |
<output_dir>/intermediates/spacers_with_desired_evidence.fna | The file contains CRISPR spacers extracted from crisprcasfinder results that have the desired evidence level. The query for blastn search |
<output_dir>/intermediates/<organism_id>_candidate_acr_aca.txt | Potential Acr/Aca regions that passed initial filters. |
<output_dir>/intermediates/<organism_id>_candidate_acr_aca.faa | Potential Acr/Aca regions in an faa format. |
<output_dir>/intermediates/<organism_id>_candidate_acr_aca_neighborhood.faa | An extension of the previous file that also inludes the neighboring proteins of the potential Acr/Aca. Used as the query for blastp search against prophage. |
<output_dir>/intermediates/<organism_id>candidate_acr_aca{blastp/rpsblast}_results.txt | Result file from blastp against prophage database or rpsblast against cdd-mge database. |
<output_dir>/intermediates/<organism_id>_candidate_acr_aca_diamond_result.txt | Results of diamond. These are search results with the Aca database as the query and <output_dir>/intermediates/<organism_id>_candidate_acr_aca.faa as the database. |
<output_dir>/intermediates/<organism_id>_candidate_acr_homolog_result.txt | Results of diamond. These are search results with the Acr database as the query and <output_dir>/intermediates/<organism_id>_candidate_acr_aca.faa as the database. |
<output_dir>/intermediates/<organism_id>_candidate_acr_aca_diamond_database.dmnd | Database of diamond made from <organism_id>_candidate_acr_aca.faa file. |
<output_dir>/intermediates/<organism_id>_acr_homolog_result.txt | Results of diamond. These are search results with the Acr database as the query and <output_dir>/subjects/<organism_id>_protein.faa as the database. |
<output_dir>/intermediates/<organism_id>_acr_homolog_result.fasta | Protein Sequence file (.faa) of protein in <output_dir>/intermediates/<organism_id>_acr_homolog_result.txt |
<output_dir>/intermediates/<organism_id>_acr_diamond_database.dmnd | Database of diamond made from <output_dir>/subjects/<organism_id>_protein.faa file |
To help users to configure the environment to use the software easily, we provide the .Dockerfile can be used using the command ([tag name]
indicates the name of the tag. You can set any tag name.):
git clone https://github.com/haidyi/acrfinder.git
cd acrfinder
docker build -t [tag name] .
If you don't want to build the image by yourself, AcrFinder is also available at Docker Hub. You can pull the AcrFinder from docker hub directly using the command:
docker pull [OPTIONS] haidyi/acrfinder:latest
python3 acr_aca_cri_runner.py -n sample_organisms/GCF_000210795.2/GCF_000210795.2_genomic.fna -f sample_organisms/GCF_000210795.2/GCF_000210795.2_genomic.gff -a sample_organisms/GCF_000210795.2/GCF_000210795.2_protein.faa -o [output_dir] -z B -c 2 -p true -g true
or you can only use .fna file as input.
python3 acr_aca_cri_runner.py -n sample_organisms/GCF_000210795.2/GCF_000210795.2_genomic.fna -o [output_dir] -z B -c 2 -p true -g true
Firstly, make sure the docker image has been pulled from the docker hub or built by yourself. AcrFinder is located at the work directory of the container.
docker run [OPTIONS] [NAME:TAG] /bin/bash
python3 acr_aca_cri_runner.py -n sample_organisms/GCF_000210795.2/GCF_000210795.2_genomic.fna -f sample_organisms/GCF_000210795.2/GCF_000210795.2_genomic.gff -a sample_organisms/GCF_000210795.2/GCF_000210795.2_protein.faa -o [output dir] -z B -c 2 -p true -g true
If you want to use your own sequence for analysis, you can use the flag -v
in docker to load your the host directory to the containder. The entire command is like this:
For example, if you want to use GCF_000210795.2 (contain .fna,gff,faa file in the directory ~/GCF_000210795.2) to implement acrfinder algorithm, you can use the command below:
docker run --rm -it -v ~/GCF_000210795.2:/app/acrfinder/GCF_000210795.2 haidyi/acrfinder:latest python3 acr_aca_cri_runner.py -n GCF_000210795.2/GCF_000210795.2_genomic.fna -f GCF_000210795.2/GCF_000210795.2_genomic.gff -a GCF_000210795.2/GCF_000210795.2_protein.faa -o GCF_000210795.2/output_dir -z B -c 2 -p true -g true
Then, you will see the output result in ~/GCF_000210795.2/output_dir.
For more information about how to use docker, you can refer to https://docs.docker.com.
Q) I ran acr_aca_cri_runner.py and I got errors that pertain to CRISPR/Cas. Whats the issue?
A) Make sure CRIPSRCasFinder
is installed properly. CRIPSRCasFinder
has many dependencies of its own and will only work if they are all installed correctly. A good indicator of a correctly installed CRIPSRCasFinder
is the following terminal output:
################################################################
# --> Welcome to dependencies/CRISPRCasFinder/CRISPRCasFinder.pl (version 4.2.17)
################################################################
vmatch2 is...............OK
mkvtree2 is...............OK
vsubseqselect2 is...............OK
fuzznuc (from emboss) is...............OK
needle (from emboss) is...............OK