GDBr : Genome identification tool for Double-strand Break Repair

GDBr (pronounced Genome Debugger) is tool that identify Double-strand Break Repair (DSBR) using genome and variant. GDBr goes through three processes to identify DSBR. First step is preprocess the genome using RagTag and svim-asm and make sure they have same chromosome name with reference. Second step is correcting the variant using BLAST and filtering the variant which have repeat bt TRF and RepeatMasker, then save a csv file. Last step is to segregate the corrected variants into the appropriate DSBRs.

You need only reference sequence and query sequences file to use GDBr.

Install

We strongly recommend using conda package manager to install GDBr.

conda create -n GDBr -c conda-forge -c bioconda -c chemical118 gdbr
conda activate GDBr
gdbr --version

Also, you can use mamba package mamager to install GDBr quickly.

mamba create -n GDBr -c conda-forge -c bioconda -c chemical118 gdbr
mamba activate GDBr
gdbr --version

Quick Start

gdbr analysis -r <reference.fa> -q <query1.fa query2.fa ...> -s <species of data> -t <number of threads>

Steps of GDBr

The above command executes the following three processes simultaneously. If you want to redo some of the processes, you can manually run the command below.

Preprocess

By using RagTag and svim-asm, GDBr preprocess data and return properly scaffolded query .fa sequence file and variant .vcf file.

gdbr preprocess -r <reference.fa> -q <query1.fa query2.fa ...> -o prepro -t <number of threads>

The preprocess step requires the use of a sorting program to do the scaffolding and variant calling, and even though they are allocated a lot of threads, they still don't use them all. An optimization approach is to distribute multiple queries to a small number of threads. However, this approach requires very high memory usage, so GDBr was developed to allow the user to freely choose this optimization by providing the --low_memory option.

Correct

By using BLAST, GDBr correct the variant file to analysis DSBR accurately. And, filter the repeat by using TRF, RepeatMasker.

gdbr correct -r <reference.fa> -q prepro/query/*.GDBr.preprocess.fa -v prepro/vcf/*.GDBr.preprocess.vcf -s <species of data> -o sv -t <number of threads>

Analysis

GDBr analysis the variant and identify DSBR mechanism.

gdbr analysis -r <reference.fa> -q prepro/query/*.GDBr.preprocess.fa -v sv/*.GDBr.correct.csv -o dsbr -t <number of threads>

You can turn on different locus DSBR analysis by --diff_locus_dsbr_analysis, however analysis can give false positives due to partial homology on the sex chromosomes.

Final output

GDBr's final ouput is <query basename>.GDBr.result.tsv. This is simple description of the final output.

Field	Description
ID	GDBr.<query order>.<variant order>
CALL_TYPE	Variant type : INS, DEL, etc
SV_TYPE	Corrected variant type : INS, DEL, SUB, etc
CHR	variant chromosome
REF_START	variant reference start location
REF_END	variant reference end location
QRY_START	variant query start location
QRY_END	variant query end location
GDBR_TYPE	GDBr variant type
HOM_LEN/HOM_START_LEN	INDEL : homology length / SUB : left homology length
HOM_END_LEN	SUB : right homology length
TEMP_INS_SEQ_LOC	templated insertion sequence location (REF or QRY)
DSBR_CHR	different locus DSBR chromosome
DSBR_START	different locus DSBR start
DSBR_END	different locus DSBR end
HOM_SEQ/HOM_START_SEQ	INDEL : homology sequence / SUB : left homology sequence
HOM_END_SEQ	SUB : right homology sequence
PUTATIVE_MECHANISM	GDBr DSB repair putative mechanism

Benckmarking

You can benchmark any command in GDBr with the --benchmark option by GNU time and psutil. It provides user time, system time, average CPU usage, multiprocessing efficiency, maximum RAM usage and wall clock time.

...
[2023-08-18 13:44:16] GDBr benchmark complete
User time (seconds) : 8007.44
System time (seconds) : 19901.00
Percent of CPU this job got : 7267%
Multiprocessing efficiency : 0.5118
Wall clock time (h:mm:ss or m:ss) : 6:23.99
Max memory usage (GB) : 23.5805

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
.github/workflows		.github/workflows
gdbr		gdbr
logo		logo
test		test
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

gdbr

gdbr

logo

logo

test

test

LICENSE

LICENSE

README.md

README.md

setup.py

setup.py

Repository files navigation

GDBr : Genome identification tool for Double-strand Break Repair

Install

Quick Start

Steps of GDBr

Preprocess

Correct

Analysis

Final output

Benckmarking

About

Releases 12

Languages

License

Chemical118/GDBr

Folders and files

Latest commit

History

Repository files navigation

GDBr : Genome identification tool for Double-strand Break Repair

Install

Quick Start

Steps of GDBr

Preprocess

Correct

Analysis

Final output

Benckmarking

About

Topics

Resources

License

Stars

Watchers

Forks

Languages