-
Notifications
You must be signed in to change notification settings - Fork 0
EVG paper
A comprehensive benchmark of graph-based genetic variant genotyping algorithms on plant genomes for creating an accurate ensemble pipeline
Welcome to the wiki page that provides comprehensive details on the code utilized by the genome graph software featured in the EVG article. In this article, we have tested eight commonly used software which include VG-MAP, VG-Giraffe, GraphAligner, Paragraph, BayesTyper, GraphTyper2, PanGenie and Gramtools.
The data utilized in EVG is described in the article. For convenience, let's assume the following file names for the input:
read.1.fq.gz
read.2.fq.gz
refgenome.fa
input.vcf.gz
Our initial step involved using BWA to map the second-generation data with the reference genome, consequently producing BAM file for downstream analysis with Paragraph, BayesTyper and GraphTyper2.
# get the BAM file
bwa mem -R '@RG\tID:foo\tSM:bar\tLB:library1' -t 10 refgenome.fa read.1.fq.gz read.2.fq.gz | samtools view -b -S | samtools sort --write-index -@ 8 -o out.bam
To begin, the initial step involves building a genome graph and creating index:
# vg autoindex
vg autoindex -T ./temp/ -M 50G -t 10 -R XG --workflow map -r refgenome.fa -v input.vcf.gz -p out
# vg snarls
vg snarls -t 10 out.xg 1>out.snarls
Next, the sequencing reads are aligned to the graph:
# vg map
vg map -t 10 -g out.gcsa -x out.xg -f read.1.fq.gz -f read.2.fq.gz 1>out.gam
Finally, genotyping the variants in the graph using the out.gam
file:
# vg pack
vg pack -t 10 -x out.xg -g out.gam -o out.pack
# vg call
vg call -t 10 -s out out.xg -k out.pack -r out.snarls 1>out.vcf
Similar to VG MAP
, the initial step involves building a genome graph and creating index:
# vg autoindex
vg autoindex -T ./temp/ -M 50G -t 10 -R XG --workflow giraffe -r refgenome.fa -v input.vcf.gz -p out
# vg snarls
vg snarls -t 10 out.xg 1>out.snarls
Next, the sequencing reads are aligned to the graph:
# vg giraffe
vg giraffe -t 10 -x out.xg -Z out.giraffe.gbz -m out.min -d out.dist -f read.1.fq.gz -f read.2.fq.gz 1>out.gam
Finally, genotyping the variants in the graph using the out.gam
file:
# vg pack
vg pack -t 10 -x out.xg -g out.gam -o out.pack
# vg call
vg call -t 10 -s out out.xg -k out.pack -r out.snarls 1>out.vcf
Similar to VG MAP
, the initial step involves building a genome graph and creating index:
# vg construct
vg construct -t 10 -r refgenome.fa -v input.vcf.gz 1>out.vg
# vg index
vg index -t 10 -x out.xg out.vg
# vg snarls
vg snarls -t 10 out.xg 1>out.snarls
Next, the sequencing reads are aligned to the graph:
# GraphAligner
GraphAligner -t 10 -g out.vg -x vg -f read.1.fq.gz -f read.2.fq.gz -a out.gam
Finally, genotyping the variants in the graph using the out.gam
file:
# vg pack
vg pack -t 10 -x out.xg -g out.gam -o out.pack
# vg call
vg call -t 10 -s out out.xg -k out.pack -r out.snarls 1>out.vcf
The required input files for Paragraph include a reference genome, a VCF file, and a configuration file that specifies the path path to BAM file. The configuration file format is specified below:
# sample.txt
id path depth read length
out out.bam 30 150
To genotype the variants, execute the command provided below:
# genotyping
multigrmpy.py -i input.vcf.gz -m sample.txt -r refgenome.fa --threads 10 -M 600 -o out
To run GraphTyper2, you will need to provide the following input files: a reference genome, a VCF file, a BAM file, coverage files for the BAM, and region files specifying the regions to be genotyped.
The BAM coverage file can be generated using the following command:
# samtools
samtools idxstats out.bam | head -n -1 | awk '{sum+=$3+$4; ref+=$2} END{print sum/ref}' 1>out.cov
The region file should follow this format:
# region.txt
chr1:1-30200790
chr2:1-45040766
...
To genotype the variants, execute the command provided below:
# genotyping
graphtyper genotype_sv refgenome.fa input.vcf.gz --sam=out.bam --region_file=region.txt --output=out --avg_cov_by_readlen=out.cov
To run BayesTyper, you will need to provide the following input files: a reference genome, a VCF file, a BAM file, and sample file hat specifies the prefix used by bayesTyperTools' makeBloom tool.
The sample file should follow this format:
# sample.tsv
out F out.bam
Before running BayesTyper, it is necessary to use KMC to count the k-mers in the BAM file.
# KMC
kmc -k55 -ci1 -t1 -fbam out.bam out.bam out.bam
Based on the generated k-mers, use bayesTyperTools makeBloom to construct a read Bloom filter.
# bayesTyperTools makeBloom
bayesTyperTools makeBloom -k out.bam -p 10
Then, use bayesTyper cluster for variants clustering.
# bayesTyper cluster
bayesTyper cluster -v input.vcf.gz -s sample.tsv -g refgenome.fa -p 10
Finally, use bayesTyper cluster for genotyping.
# bayesTyper cluster
for i in $(ls | grep 'bayestyper_unit_'); do
bayesTyper genotype -v $i/variant_clusters.bin -c bayestyper_cluster_data -s sample.tsv -g refgenome.fa -o $i/bayestyper -z -p 10 --noise-genotyping
done
Running PanGenie requires only one simple command. However, as PanGenie cannot recognize compressed files, the first step is to decompress the input file before genotyping.
# decompress
gunzip input.vcf.gz
gunzip read.1.fq.gz
gunzip read.2.fq.gz
cat read.1.fq read.2.fq > out.fq
# PanGenie
PanGenie -s out -i out.fq -r refgenome.fa -v out.vcf -t 10 -j 10 -o out
The first step in running Gramtools is to build a genome graph.
# gramtools build
gramtools build --max_threads 10 -o ./build --ref refgenome.fa --vcf input.vcf.gz --force --debug
After which, you can perform genotyping using the following command:
# gramtools genotype
gramtools genotype --max_threads 10 --debug --force --sample_id out -i build --ploidy diploid -o out --reads read.1.fq.gz read.2.fq.gz