GDBr (pronounced Genome Debugger) is tool that identify Double-strand Break Repair (DSBR) using genome and variant. GDBr goes through three processes to identify DSBR. First step is preprocess the genome using RagTag
and svim-asm
and make sure they have same chromosome name with reference. Second step is correcting the variant using BLAST
and filtering the variant which have repeat bt TRF
and RepeatMasker
, then save a csv file. Last step is to segregate the corrected variants into the appropriate DSBRs.
You need only reference sequence and query sequences file to use GDBr
.
We strongly recommend using conda
package manager to install GDBr
.
conda create -n GDBr -c conda-forge -c bioconda -c chemical118 gdbr
conda activate GDBr
gdbr --version
Also, you can use mamba
package mamager to install GDBr
quickly.
mamba create -n GDBr -c conda-forge -c bioconda -c chemical118 gdbr
mamba activate GDBr
gdbr --version
gdbr analysis -r <reference.fa> -q <query1.fa query2.fa ...> -s <species of data> -t <number of threads>
The above command executes the following three processes simultaneously. If you want to redo some of the processes, you can manually run the command below.
By using RagTag
and svim-asm
, GDBr
preprocess data and return properly scaffolded query .fa
sequence file and variant .vcf
file.
gdbr preprocess -r <reference.fa> -q <query1.fa query2.fa ...> -o prepro -t <number of threads>
The preprocess step requires the use of a sorting program to do the scaffolding and variant calling, and even though they are allocated a lot of threads, they still don't use them all. An optimization approach is to distribute multiple queries to a small number of threads. However, this approach requires very high memory usage, so GDBr was developed to allow the user to freely choose this optimization by providing the --low_memory
option.
By using BLAST
, GDBr
correct the variant file to analysis DSBR accurately. And, filter the repeat by using TRF
, RepeatMasker
.
gdbr correct -r <reference.fa> -q prepro/query/*.GDBr.preprocess.fa -v prepro/vcf/*.GDBr.preprocess.vcf -s <species of data> -o sv -t <number of threads>
GDBr
analysis the variant and identify DSBR mechanism.
gdbr analysis -r <reference.fa> -q prepro/query/*.GDBr.preprocess.fa -v sv/*.GDBr.correct.csv -o dsbr -t <number of threads>
You can turn on different locus DSBR analysis by --diff_locus_dsbr_analysis
, however analysis can give false positives due to partial homology on the sex chromosomes.
GDBr
's final ouput is <query basename>.GDBr.result.tsv
. This is simple description of the final output.
Field | Description |
---|---|
ID | GDBr.<query order>.<variant order> |
CALL_TYPE | Variant type : INS, DEL, etc |
SV_TYPE | Corrected variant type : INS, DEL, SUB, etc |
CHR | variant chromosome |
REF_START | variant reference start location |
REF_END | variant reference end location |
QRY_START | variant query start location |
QRY_END | variant query end location |
GDBR_TYPE | GDBr variant type |
HOM_LEN/HOM_START_LEN | INDEL : homology length / SUB : left homology length |
HOM_END_LEN | SUB : right homology length |
TEMP_INS_SEQ_LOC | templated insertion sequence location (REF or QRY) |
DSBR_CHR | different locus DSBR chromosome |
DSBR_START | different locus DSBR start |
DSBR_END | different locus DSBR end |
HOM_SEQ/HOM_START_SEQ | INDEL : homology sequence / SUB : left homology sequence |
HOM_END_SEQ | SUB : right homology sequence |
PUTATIVE_MECHANISM | GDBr DSB repair putative mechanism |
You can benchmark any command in GDBr with the --benchmark
option by GNU time
and psutil
. It provides user time, system time, average CPU usage, multiprocessing efficiency, maximum RAM usage and wall clock time.
...
[2023-08-18 13:44:16] GDBr benchmark complete
User time (seconds) : 8007.44
System time (seconds) : 19901.00
Percent of CPU this job got : 7267%
Multiprocessing efficiency : 0.5118
Wall clock time (h:mm:ss or m:ss) : 6:23.99
Max memory usage (GB) : 23.5805