Skip to content

Genome identification tool for Double-strand Break Repair

License

Notifications You must be signed in to change notification settings

Chemical118/GDBr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

75 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GDBr : Genome identification tool for Double-strand Break Repair

GDBr logo GDBr (pronounced Genome Debugger) is tool that identify Double-strand Break Repair (DSBR) using genome and variant. GDBr goes through three processes to identify DSBR. First step is preprocess the genome using RagTag and svim-asm and make sure they have same chromosome name with reference. Second step is correcting the variant using BLAST and filtering the variant which have repeat bt TRF and RepeatMasker, then save a csv file. Last step is to segregate the corrected variants into the appropriate DSBRs.

CI codecov anaconda

You need only reference sequence and query sequences file to use GDBr.

Install

We strongly recommend using conda package manager to install GDBr.

conda create -n GDBr -c conda-forge -c bioconda -c chemical118 gdbr
conda activate GDBr
gdbr --version

Also, you can use mamba package mamager to install GDBr quickly.

mamba create -n GDBr -c conda-forge -c bioconda -c chemical118 gdbr
mamba activate GDBr
gdbr --version

Quick Start

gdbr analysis -r <reference.fa> -q <query1.fa query2.fa ...> -s <species of data> -t <number of threads>

Steps of GDBr

The above command executes the following three processes simultaneously. If you want to redo some of the processes, you can manually run the command below.

Preprocess

By using RagTag and svim-asm, GDBr preprocess data and return properly scaffolded query .fa sequence file and variant .vcf file.

gdbr preprocess -r <reference.fa> -q <query1.fa query2.fa ...> -o prepro -t <number of threads>

The preprocess step requires the use of a sorting program to do the scaffolding and variant calling, and even though they are allocated a lot of threads, they still don't use them all. An optimization approach is to distribute multiple queries to a small number of threads. However, this approach requires very high memory usage, so GDBr was developed to allow the user to freely choose this optimization by providing the --low_memory option.

Correct

By using BLAST, GDBr correct the variant file to analysis DSBR accurately. And, filter the repeat by using TRF, RepeatMasker.

gdbr correct -r <reference.fa> -q prepro/query/*.GDBr.preprocess.fa -v prepro/vcf/*.GDBr.preprocess.vcf -s <species of data> -o sv -t <number of threads>

Analysis

GDBr analysis the variant and identify DSBR mechanism.

gdbr analysis -r <reference.fa> -q prepro/query/*.GDBr.preprocess.fa -v sv/*.GDBr.correct.csv -o dsbr -t <number of threads>

You can turn on different locus DSBR analysis by --diff_locus_dsbr_analysis, however analysis can give false positives due to partial homology on the sex chromosomes.

Final output

GDBr's final ouput is <query basename>.GDBr.result.tsv. This is simple description of the final output.

Field Description
ID GDBr.<query order>.<variant order>
CALL_TYPE Variant type : INS, DEL, etc
SV_TYPE Corrected variant type : INS, DEL, SUB, etc
CHR variant chromosome
REF_START variant reference start location
REF_END variant reference end location
QRY_START variant query start location
QRY_END variant query end location
GDBR_TYPE GDBr variant type
HOM_LEN/HOM_START_LEN INDEL : homology length / SUB : left homology length
HOM_END_LEN SUB : right homology length
TEMP_INS_SEQ_LOC templated insertion sequence location (REF or QRY)
DSBR_CHR different locus DSBR chromosome
DSBR_START different locus DSBR start
DSBR_END different locus DSBR end
HOM_SEQ/HOM_START_SEQ INDEL : homology sequence / SUB : left homology sequence
HOM_END_SEQ SUB : right homology sequence
PUTATIVE_MECHANISM GDBr DSB repair putative mechanism

Benckmarking

You can benchmark any command in GDBr with the --benchmark option by GNU time and psutil. It provides user time, system time, average CPU usage, multiprocessing efficiency, maximum RAM usage and wall clock time.

...
[2023-08-18 13:44:16] GDBr benchmark complete
User time (seconds) : 8007.44
System time (seconds) : 19901.00
Percent of CPU this job got : 7267%
Multiprocessing efficiency : 0.5118
Wall clock time (h:mm:ss or m:ss) : 6:23.99
Max memory usage (GB) : 23.5805