Direct genome-wide identification of G-quadruplex structures by whole-genome resequencing

G4-miner

G4-miner is a genome-wide approach to directly identify G4 structures from standard sequencing data. G4-miner inspects sequencing quality in the whole genome, seizes unexpected fluctuations in local, and ascribes some of the quality fluctuations to the unstable formation of G4.

Installation

Using Git:

git clone https://github.com/tulabcode/G4-miner

Prerequisities

G4-miner is implemented in perl and shell, a command line software.

Required:

Linux system
g4predict, install it following the installtion before run G4-miner.
perl, recommended version 5.18.2 or higher
It is recommended to install in advance three perl moudles: Getopt::Long, File::Basename, File::Spec.
python2, recommended version 2.7.16

Running G4-miner for predictting

Follow the instructions:

# enter the G4-miner directory
cd path_to_G4-miner_directory

# add rights to execute
chmod +x G4-miner.pl

# run G4-miner
./G4-Miner.pl or perl /path_to_G4-miner_directory/G4-miner.pl
Usage:
        perl G4-miner.pl [options]
Options:
                -i <file> <input bam/sam file>
                -d <path> <the root directory for saving output>
                -g <file> <the fasta file of the whole genome reference>
                -rm <int> <rm the directory of $hdir/$samp, default: 1, yes>
                -ch <string> <specify some specific chromosomes for analysing, 
                             -ch chr1 -ch chr2 ... -ch chrX> <Optional, Default genome wide>
                -py <file> <the absolute path of python2> (default: `which python2`)
                -g4 <file> <the absolute path of g4predict> (default: `which g4predict`)
                -rl <int> <the longest length of reads, default = 150 nt>
                -dl <int> <the length of detection region, defalut = 75 nt>
                -sp <float> <Specie for selecting cutoff of G ratio, here have six species:
                             human/mouse/fruitfly/thaliana/elegans/yeast,
                             default: human with 0.28, if the specie did not include in this
                             paper, can calculate by one more step>
                -n <string> <the name of the sample>

The default model is set to a model trained on the human dataset, as described in our paper.

NOTE: The following table shows the corresponding G-ratio of different species:

Species	G-ratio
Homo sapiens	0.28
Mus musculus	0.28
Drosophila melanogaster	0.27
Arabidopsis Thaliana	0.24
Caenorhabditis elegans	0.25
Saccharomyces cerevisiae	0.29

Reference genomes and data sources

Species	Reference genome source	Data source (SRR code)
Homo sapiens	https://hgdownload.soe.ucsc.edu/goldenPath/hg19/	SRR9644818
Mus musculus	https://hgdownload.soe.ucsc.edu/goldenPath/mm10/	SRR13179566
Drosophila melanogaster	https://hgdownload.soe.ucsc.edu/goldenPath/dm6/	SRR12822760
Arabidopsis Thaliana	https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1/	SRR11608990
Caenorhabditis elegans	https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/985/GCF_000002985.6_WBcel235/	SRR8816429
Saccharomyces cerevisiae	https://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/	SRR13747318

sacCer Test data

Download

Here, we provide a dataset from Saccharomyces cerevisiae for testing G4-miner. Contains the reference genome file of Saccharomyces cerevisiae (sacCer3.fa) and indexed (sacCer3.fa.fai), and there is also a bam file that is aligned to the genome, sorted (SRR13747318.sort.bam) and indexed (SRR13747318.sort.bam.bai). Could download from Southeast University Document Cloud. Download test datasets, create a test directory, upload the downloaded files by using rz or FileZilla, and then run test.

run Test

Just following the instruction for running:

# Download test datasets

# create a test directory
cd ~ && mkdir -p work/sacCer && cd ~/work/sacCer

# upload the downloaded files by using `rz` or `FileZilla`

# run test
perl /path_to_G4-miner_directory/G4-miner.pl -i ~/work/sacCer/SRR13747318.sort.bam -d ~/work/sacCer/G4-miner -n sacCer -g ~/work/sacCer/sacCer3.fa -py /usr/bin/python2 -g4 /usr/local/bin/g4predict -sp yeast
## here "~"" represents the personal home directory of Linux

Result analysis

This project provides a case analysis of the species Saccharomyces cerevisiae (Yeast), and the output results are shown in the folder "~/work/sacCer/G4-miner".

In the output directory, each chromosome generated a corresponding experimental result folder. According to the experimental results of each chromosome, this tool counted the number of all positive regions, negative regions and false-positive regions, and calculated the number of regions that meet the conditions of different M and N parameters and found the parameter values corresponding to the most representative low-quality regions. In addition, this tool also distinguished between forward and reverse strand. The suffix with ‘f’ corresponds to the forward strand, and the suffix with ‘r’ corresponds to the reverse strand.

In each chromosome directory, the forward and reverse strand respectively generated 11 result files, and each file contained the following content:

<Median.gz>: The median of the quality value corresponding to each base on the chromosome

<PQ_g4predict>: The PQ file on the whole genome predicted by G4predict, named "PQ_g4predict_minus" on the forward strand, and "PQ_g4predict_plus" on the reverse strand

<positive_array>: The number of positive regions corresponding to different M and N parameter conditions

<negative_sloci.gz>: Position information of the negative region corresponding to different M and N parameter conditions

<false_positive_sloci.gz>: Position information of the false-positive region corresponding to different M and N parameter conditions

<negative_array>: The number of negative regions corresponding to different M and N parameter conditions.

<false_positive_array>: The number of false-positive regions corresponding to different M and N parameter conditions.

<OQ.gz>: Low-quality region sequence collection, that is, the observed quadruplex (OQ), there is overlap between the sequences

<G4_OQ_unmerged>: A collection of sequences related to the G-quadruplex with overlap screened by the characteristics of the sequence

<G4_OQ_merged>: A collection of sequences related to the G-quadruplex without overlap screened by the characteristics of the sequence

<G4_PQinOQ>: PQ set included by OQ

Contact information

Jing Tu, jtu@seu.edu.cn

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Scripts		Scripts
G4-miner.pl		G4-miner.pl
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scripts

Scripts

G4-miner.pl

G4-miner.pl

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Direct genome-wide identification of G-quadruplex structures by whole-genome resequencing

G4-miner

Installation

Prerequisities

Running G4-miner for predictting

Reference genomes and data sources

sacCer Test data

Download

run Test

Result analysis

Contact information

Citing G4-miner

About

Releases

Packages

Languages

License

dulunar/G4-miner

Folders and files

Latest commit

History

Repository files navigation

Direct genome-wide identification of G-quadruplex structures by whole-genome resequencing

G4-miner

Installation

Prerequisities

Running G4-miner for predictting

Reference genomes and data sources

sacCer Test data

Download

run Test

Result analysis

Contact information

Citing G4-miner

About

Resources

License

Stars

Watchers

Forks

Languages