DANGER Analysis

Risk-averse on/off-target assessment for CRISPR editing without reference genome.

DANGER (Deleterious and ANticipatable Guides Evaluated by RNA-sequencing) Analysis; a bioinformatics pipeline can elucidate genomic on/off-target sites on mRNA-transcribed regions related to expression changes and then quantify phenotypic risk at the gene ontology (GO) term level using RNA-seq data.

DANGER Analysis would be helpful for people who want to

screen off-target sites on mRNA-transcribed regions without reference genome.
find off-target sites resulting in decreased expression using RNA-seq data.
annotate off-target genes with GO terms
get personal transcriptome-based on/off-target profile
provide transcriptome-aware on/off-target profile without reference genome.
search a potential on-target on expressed genes without reference genome.
quantify gene expressions of on/off-target sites.

If you have a question and request, don't hesitate to contact me.

Kazuki Nakamae, Ph.D.

kazukinakamae[at mark]gmail.com

References

[1] Nakamae and Bono, DANGER analysis: risk-averse on/off-target assessment for CRISPR editing without a reference genome. Bioinformatics Advances, vbad114. 2023 https://doi.org/10.1093/bioadv/vbad114

[2] Nakamae and Bono, DANGER analysis: Risk-averse on/off-target assessment for CRISPR editing without a reference genome. bioRxiv. 2023 https://doi.org/10.1101/2023.03.11.531115

Run DANGER Analysis using Docker images (Recommended)

Please prepare the following files in your current directory

"fltr_lowexpr_dj1_trinity_out_dir.Trinity.fasta": de novo transcriptome assembly (without redundancy)
"guide_pam.fa": The binding sequence of protospacer & PAM (e.g. GCCGGTTCAGTGCAGCCGTGAGG)
Each "RSEM.isoforms.results": the expression profiles exported from Trinity:align_and_estimate_abundance.pl

The example dataset was deposited in SourceForge.

Step1: Summarize expression profiles

Run collect_exp_data.py in Docker image.

sudo docker run --name example_collectexp --memory 10g --rm -v `pwd`:/DATA -w /DATA -i kazukinakamae/dangeranalysis:1.1 python /tmp/collect_exp_data.py \
-o <Output directory> \
-w <"RSEM.isoforms.results" in the expression profiles of WT samples> \
-e <"RSEM.isoforms.results" in the expression profiles of Edited samples> \
-t <Threshold for Edited/WT ratio>;

(EXAMPLE)

sudo docker run --name example_collectexp --memory 10g --rm -v `pwd`:/DATA -w /DATA -i kazukinakamae/dangeranalysis:1.1 python /tmp/collect_exp_data.py \
-o exp_collection \
-w rmrna_dj1_ctrl_rep1/RSEM.isoforms.results rmrna_dj1_ctrl_rep2/RSEM.isoforms.results rmrna_dj1_ctrl_rep3/RSEM.isoforms.results \
-e rmrna_dj1_ko_rep1/RSEM.isoforms.results rmrna_dj1_ko_rep2/RSEM.isoforms.results rmrna_dj1_ko_rep3/RSEM.isoforms.results \
-t 2.5; # TPM ratio < 1/2.5 = 0.4

The output:

ctrl_edited_fltrexpr_contig_expected_count_onratio.csv: Comma-separated text file, including read count values of each sample, Edited/WT ratio, and expression labels
ctrl_edited_fltrexpr_contig_fpkm_onratio.csv: Comma-separated text file, including FPKM values of each sample, Edited/WT ratio, and expression labels
ctrl_edited_fltrexpr_contig_tpm_onratio.csv: Comma-separated text file, including TPM values of each sample, Edited/WT ratio, and expression labels

collect_exp_data.py options

-o:     Output directory. The name should be unique.
-w:     "RSEM.isoforms.results" in the expression profiles of WT samples.
-e:     "RSEM.isoforms.results" in the expression profiles of Edited samples
-t:     Threshold for Edited/WT ratio. Edited/WT > (Threshold) is "upregulated." Edited/WT < 1/(Threshold) is "downregulated."

(Optional): Use expression profiles with DEG

"DANGER Analysis can utilize the DEG profile instead of the aforementioned expression profiles."

DEG profile can be made according to the below example.

DEG analysis of de novo transcriptome assembly from Trinity

"The DEG profile can be adapted to the input format of DANGER analysis by processing it as follows."

cat DEG_profile.csv | tr -d '"' > DEG_profile.csv_clean.csv;
echo 'id,name,<ctrl sample 1>,<ctrl sample 2>,...,<ctrl sample n>,<edited sample 1>,<edited sample 2>,...,<edited sample n>,gene_id,a.value,m.value,p.value,q.value,rank,estimatedDEG,Exp' > DEG_data_header.csv
cp DEG_data_header.csv DEG_profile_fixed.csv
tail -n +2 DEG_profile.csv_clean.csv >> DEG_profile_fixed.csv
### Use p < <alpha>
awk -F, 'BEGIN{OFS=","} {if (NR==1)print $0;else {if($14 < <alpha> && $13 < 0) print $0, "downregulated";else if($14 < <alpha> && $13 > 0) print $0, "upregulated"; else print $0, "unchanged"}}' \
DEG_profile_fixed.csv \
> complete_DEG_profile.csv;

Step2: GO Annotation & D-index Calculation

Run dangeranalysis_v1.sh in Docker image.

sudo docker run --name example_danalysis --memory 100g --rm -v `pwd`:/DATA -w /tmp -i kazukinakamae/dangeranalysis:1.1 bash /tmp/dangeranalysis_v1.sh \
<Database for GO annotation> \
<Database Type> \
<Output directory> \
<expression profile> \
<de novo transcriptome assembly (without redundancy)> \
<binding sequence of protospacer & PAM> \
<mismatch number for off-target search> \
<PAM for off-target search>;

(EXAMPLE)

sudo docker run --name example_danalysis --memory 100g --rm -v `pwd`:/DATA -w /tmp -i kazukinakamae/dangeranalysis:1.1 bash /tmp/dangeranalysis_v1.sh \
Dr \
pep \
output \
exp_collection/ctrl_edited_fltrexpr_contig_tpm_onratio.csv \
fltr_lowexpr_dj1_trinity_out_dir.Trinity.fasta \
guide_pam.fa \
11 \
NGG;

dangeranalysis_v1.sh options

Database for GO annotation:    Available database for GO annotation
Users can choose here:
- Hs : Human
- Mm : Mouse
- Dm : Fly
- Ce : C.elegans
- Dr : Zebrafish
- Ag : Mosquito
- At : A.thaliana
- Bt : Cow
- Cf : Dog
- EcK12 : E.coli K-12
- EcSakai E.coli O157:H7 str. Sakai
- Gg : Chicken
- Mmu : Monkey
- Pt : Chimpanzee
- Rn : Rat
- Sc : Yeast
- Ss : Pig
- Xl : Frog
- Mxanthus : M.xanthus

Database Type:     Sequence used for gene annotation
- pep : Protein sequence
- cdna : mRNA sequence

Output directory:     Output directory. The name should be a unique

summary of the TPM:     Comma-separated text file, including TPM values of each sample, Edited/WT ratio, and expression labels

de novo transcriptome assembly (without redundancy):     de novo transcriptome assembly from Trinity

binding sequence of protospacer & PAM:     Fasta file including single sequence of on-target site

mismatch number for off-target search:     mismatch number used in CrisFlash. It can be 1-11 nt

PAM for off-target search:   PAM sequence used in CrisFlash. SpCas9 is generally NGG. If you want to consider minor PAM in SpCas9, you can choose NRR.

The result is saved in DANGER_analysis_result.

The table of D-indice will be saved as "DANGER_index_on_***.txt".

Troubleshooting

If you are a non-root user, the command can fail to download databases. We recommend that the root user runs the command.

Step3: D-index Evaluation

In DANGER Analysis, it is possible to use permutation data to evaluate the significance of the calculated D-index. The procedure for this is described below.

First, create expression permutation data. If you want to create 100 instances, input as follows

mkdir exp_p_data;
sudo docker run --name genarate_exp_p_data --memory 20g --rm -v `pwd`:/DATA -w /DATA -i kazukinakamae/dangertest:1.0 \
bash /tmp/get_multi_exp_permutation_data.sh \
<original summary of TPM> \
<First seed number> \
<Last seed number> \
exp_p_data/exp;

(EXAMPLE)

mkdir example_exp_p_data;
sudo docker run --name genarate_exp_p_data --memory 10g --rm -v `pwd`:/DATA -w /DATA -i kazukinakamae/dangertest:1.0 \
bash /tmp/get_multi_exp_permutation_data.sh \
exp_collection/ctrl_edited_fltrexpr_contig_tpm_onratio.csv \
1 \
100 \
example_exp_p_data/exp;

The output data will be outputted as a directory, with the seed value added to the prefix(e.g. park7_2_5_exp_p_data/exp1, park7_2_5_exp_p_data/exp2, ...).

Second, create off-target permutation data. If you want to create 100 instances, input as follows

mkdir mm_p_data;
sudo docker run --name genarate_mm_p_data --memory 20g --rm -v `pwd`:/DATA -w /DATA -i kazukinakamae/dangertest:1.0 \
bash /tmp/get_multi_mm_permutation_data.sh \
<Original off-target profile> \
<First seed number> \
<Last seed number> \
mm_p_data/mm;

(EXAMPLE)

mkdir example_mm_p_data;
sudo docker run --name genarate_mm_p_data --memory 10g --rm -v `pwd`:/DATA -w /DATA -i kazukinakamae/dangertest:1.0 \
bash /tmp/get_multi_mm_permutation_data.sh \
output/Result_offtarget_all.cas-offinder \
1 \
100 \
example_mm_p_data/mm;

The output data will be outputted as a directory, with the seed value added to the prefix(e.g. park7_mm8_NGG_data/mm1, park7_mm8_NGG_data/mm2, ...).

Next, calculate the pseudo D-index based on the permutation data.

mkdir p_result;
for seed in $(seq <First seed number> <Last seed number>)
do
        sudo docker run \
                --name example_dangeranalysis_false_p${seed} --memory 20g --rm -v `pwd`:/DATA -w /tmp -i kazukinakamae/dangertest:1.0 bash /tmp/dangertest_v2.sh \
                <Database for GO annotation> \
                <Database Type> \
                p_result/p${seed} \
                exp_p_data/exp_p${seed}/ctrl_edited_fltrexpr_contig_tpm_onratio.csv \
                mm_p_data/mm_p${seed}/Result_offtarget_all.cas-offinder \
                <Original directory of DANGER analysis> \
                <binding sequence of protospacer & PAM>;
done

(EXAMPLE)

mkdir false_output;
for seed in $(seq 1 100)
do
  sudo docker run \
          --name example_dangeranalysis_false_p${seed} --memory 20g --rm -v `pwd`:/DATA -w /tmp -i kazukinakamae/dangertest:1.0 bash /tmp/dangertest_v2.sh \
          Dr \
          pep \
          false_output/p${seed} \
          example_exp_p_data/exp_p${seed}/ctrl_edited_fltrexpr_contig_tpm_onratio.csv \
          example_mm_p_data/mm_p${seed}/Result_offtarget_all.cas-offinder \
          output \
          guide_pam.fa;
done

D-indices greater than the specified confidence interval from the t-distribution calculated from the pseudo D-indices are considered significant

sudo docker run \
--name example_dindextest --memory 20g --rm -v `pwd`:/DATA -w /DATA -i kazukinakamae/dangertest:1.0 \
python /tmp/dindex_test.py \
-t <Original directory of DANGER analysis> \
-n <directory of pseudo DANGER analysis> \
-c <Confidence interval> \
-o <Output directory>;

(EXAMPLE)

sudo docker run \
--name example_dindextest --memory 20g --rm -v `pwd`:/DATA -w /DATA -i kazukinakamae/dangertest:1.0 \
python /tmp/dindex_test.py \
-t output \
-n false_output \
-c 0.999999999999999 \
-o example_dindex_test_n100_ci99_9999999999999percent;

The table of significant D-indice will be saved as "Significant_DANGER_index_on_***.txt".

(EXAMPLE)

The table will be visualized and saved as "Significant_DANGER_index_on_***.tiff."

(EXAMPLE)

(Optional) Step4: False positive test of D-index Evaluation

In the D-index evaluation method up to Step 3, parameters such as the number of mismatches to consider, the TPM threshold, PAM sequences, confidence intervals, and permutation data can be decided at the user's discretion. However, it is also possible to measure false positives based on these results. The procedure for this is described below.

First, create expression permutation data for the test. If you want to create 10 instances, input as follows An important note is that this permutation data must be created in a separate directory from the permutation data created in Step 3, and the seed value must be set to a number that does not duplicate with that in Step 3.

mkdir falsepos_exp_p_data;
sudo docker run \
            --name falsepos_park7_2_5_exp_p_data --memory 20g --rm -v `pwd`:/DATA -w /DATA -i kazukinakamae/dangertest:1.0 \
            bash /tmp/get_multi_exp_permutation_data.sh \
            <original expression profile> \
            <First seed number> \
            <Last seed number> \
            falsepos_exp_p_data/<Output prefix>;

(EXAMPLE)

mkdir falsepos_example_exp_p_data;
sudo docker run --name falsepos_example_exp_p_data --memory 20g --rm -v `pwd`:/DATA -w /DATA -i kazukinakamae/dangertest:1.0 \
bash /tmp/get_multi_exp_permutation_data.sh \
exp_collection/ctrl_edited_fltrexpr_contig_tpm_onratio.csv \
1001 \
1010 \
falsepos_example_exp_p_data/exp;

Second, create off-target permutation data. If you want to create 100 instances, input as follows An important note is that this permutation data must be created in a separate directory from the permutation data created in Step 3, and the seed value must be set to a number that does not duplicate with that in Step 3.

mkdir falsepos_mm_p_data;
sudo docker run \
            --name falsepos_park7_mm8_NGG_data --memory 20g --rm -v `pwd`:/DATA -w /DATA -i kazukinakamae/dangertest:1.0 \
            bash /tmp/get_multi_mm_permutation_data.sh \
            <Original off-target profile> \
            <First seed number> \
            <Last seed number> \
            falsepos_mm_p_data/mm;

(EXAMPLE)

mkdir falsepos_example_mm_p_data;
sudo docker run --name falsepos_example_mm_p_data --memory 20g --rm -v `pwd`:/DATA -w /DATA -i kazukinakamae/dangertest:1.0 \
bash /tmp/get_multi_mm_permutation_data.sh \
output/Result_offtarget_all.cas-offinder \
1001 \
1010 \
falsepos_example_mm_p_data/mm;

Next, calculate the pseudo D-index based on the above permutation data. The paramter shoud be same as Step3

mkdir falsepos_p_result;
for seed in $(seq <First seed number> <Last seed number>)
do
        sudo docker run \
                --name falsepos_park7_2_5_random_mm8_NGG_dangeranalysis_p${seed} --memory 20g --rm -v `pwd`:/DATA -w /tmp -i kazukinakamae/dangertest:1.0 bash /tmp/dangertest_v2.sh \
                <Database for GO annotation> \
                <Database Type> \
                falsepos_p_result/p${seed} \
                falsepos_exp_p_data/exp_p${seed}/ctrl_edited_fltrexpr_contig_tpm_onratio.csv \
                falsepos_mm_p_data/mm_p${seed}/Result_offtarget_all.cas-offinder \
                <Original directory of DANGER analysis> \
                <binding sequence of protospacer & PAM>;
done

(EXAMPLE)

mkdir falsepos_outputs;
for seed in $(seq 1001 1010)
do
        sudo docker run \
                --name falsepos_park7_2_5_random_mm8_NGG_dangeranalysis_p${seed} --memory 20g --rm -v `pwd`:/DATA -w /tmp -i kazukinakamae/dangertest:1.0 bash /tmp/dangertest_v2.sh \
                Dr \
                pep \
                falsepos_outputs/p${seed} \
                falsepos_example_exp_p_data/exp_p${seed}/ctrl_edited_fltrexpr_contig_tpm_onratio.csv \
                falsepos_example_mm_p_data/mm_p${seed}/Result_offtarget_all.cas-offinder \
                output \
                guide_pam.fa;
done

The pseudo D-indices greater than the specified confidence interval from the t-distribution calculated from the pseudo D-indices of Step3 are considered significant The paramter shoud be same as Step3

mkdir <directory of D-index test using pseudo DANGER analysis>;
for seed in $(seq <First seed number> <Last seed number>)
do
        sudo docker run \
                --name dangeranalysis_dindextest_${seed} --memory 20g --rm -v `pwd`:/DATA -w /DATA -i kazukinakamae/dangertest:1.0 \
                python /tmp/dindex_test.py \
                -t falsepos_p_result/p${seed} \
                -n <directory of pseudo DANGER analysis> \
                -c <Confidence interval> \
                -o <directory of D-index test using pseudo DANGER analysis>/<user-defined label>_${seed};
done

(EXAMPLE)

mkdir falsepos_Dindex_test_n100;
for seed in $(seq 1001 1010)
do
        sudo docker run \
                --name dangeranalysis_dindextest_${seed} --memory 20g --rm -v `pwd`:/DATA -w /DATA -i kazukinakamae/dangertest:1.0 \
                python /tmp/dindex_test.py \
                -t falsepos_outputs/p${seed} \
                -n false_output \
                -c 0.999999999999999 \
                -o falsepos_Dindex_test_n100/99_9999999999999percent_seed_${seed};
done

The proportion of the number of D-index detected in 'Significant_DANGER_index_on_XXX.txt' relative to that in 'All_DANGER_index_on_XXX.txt' is calculated as the false positive rate.

Run DANGER Analysis in your local environment using conda (Deprecated)

Click here

Installation of DANGER Analysis

The DANGER Analysis consists of python and R with various bioinformatics tools. All processes run under Anaconda and Docker environments.

1. Installation of Docker

on MacOSX

Download and install Docker Desktop: https://docs.docker.com/engine/install/#desktop
Enter Docker settings menu to adjust the memory allocation (≥64GB of memory is recommmended)