Skip to content

EVG paper

杜泽臻 edited this page May 27, 2023 · 16 revisions

A comprehensive benchmark of graph-based genetic variant genotyping algorithms on plant genomes for creating an accurate ensemble pipeline

Welcome to the wiki page that provides comprehensive details on the code utilized by the genome graph software featured in the EVG article. In this article, we have tested eight commonly used software which include VG-MAP, VG-Giraffe, GraphAligner, Paragraph, BayesTyper, GraphTyper2, PanGenie and Gramtools.

Data

The data utilized in EVG is described in the article. For convenience, let's assume the following file names for the input:

  • read.1.fq.gz
  • read.2.fq.gz
  • refgenome.fa
  • input.vcf.gz

BWA MEM

Our initial step involved using BWA to map the second-generation data with the reference genome, consequently producing BAM file for downstream analysis with Paragraph, BayesTyper and GraphTyper2.

# get the BAM file
bwa mem -R '@RG\tID:foo\tSM:bar\tLB:library1' -t 10 refgenome.fa read.1.fq.gz read.2.fq.gz | samtools view -b -S | samtools sort --write-index -@ 8 -o out.bam

VG MAP

To begin, the initial step involves building a genome graph and creating index:

# vg autoindex
vg autoindex -T ./temp/ -M 50G -t 10 -R XG --workflow map -r refgenome.fa -v input.vcf.gz -p out
# vg snarls
vg snarls -t 10 out.xg 1>out.snarls

Next, the sequencing reads are aligned to the graph:

# vg map
vg map -t 10 -g out.gcsa -x out.xg -f read.1.fq.gz -f read.2.fq.gz 1>out.gam

Finally, genotyping the variants in the graph using the out.gam file:

# vg pack
vg pack -t 10 -x out.xg -g out.gam -o out.pack
# vg call
vg call -t 10 -s out out.xg -k out.pack -r out.snarls 1>out.vcf

VG Giraffe

Similar to VG MAP, the initial step involves building a genome graph and creating index:

# vg autoindex
vg autoindex -T ./temp/ -M 50G -t 10 -R XG --workflow giraffe -r refgenome.fa -v input.vcf.gz -p out
# vg snarls
vg snarls -t 10 out.xg 1>out.snarls

Next, the sequencing reads are aligned to the graph:

# vg giraffe
vg giraffe -t 10 -x out.xg -Z out.giraffe.gbz -m out.min -d out.dist -f read.1.fq.gz -f read.2.fq.gz 1>out.gam

Finally, genotyping the variants in the graph using the out.gam file:

# vg pack
vg pack -t 10 -x out.xg -g out.gam -o out.pack
# vg call
vg call -t 10 -s out out.xg -k out.pack -r out.snarls 1>out.vcf

GraphAligner

Similar to VG MAP, the initial step involves building a genome graph and creating index:

# vg construct
vg construct -t 10 -r refgenome.fa -v input.vcf.gz 1>out.vg
# vg index
vg index -t 10 -x out.xg out.vg
# vg snarls
vg snarls -t 10 out.xg 1>out.snarls

Next, the sequencing reads are aligned to the graph:

# GraphAligner
GraphAligner -t 10 -g out.vg -x vg -f read.1.fq.gz -f read.2.fq.gz -a out.gam

Finally, genotyping the variants in the graph using the out.gam file:

# vg pack
vg pack -t 10 -x out.xg -g out.gam -o out.pack
# vg call
vg call -t 10 -s out out.xg -k out.pack -r out.snarls 1>out.vcf

Paragraph

The required input files for Paragraph include a reference genome, a VCF file, and a configuration file that specifies the path path to BAM file. The configuration file format is specified below:

# sample.txt
id      path      depth     read length
out     out.bam   30        150

To genotype the variants, execute the command provided below:

# genotyping
multigrmpy.py -i input.vcf.gz -m sample.txt -r refgenome.fa --threads 10 -M 600 -o out

GraphTyper2

To run GraphTyper2, you will need to provide the following input files: a reference genome, a VCF file, a BAM file, coverage files for the BAM, and region files specifying the regions to be genotyped.

The BAM coverage file can be generated using the following command:

# samtools
samtools idxstats out.bam | head -n -1 | awk '{sum+=$3+$4; ref+=$2} END{print sum/ref}' 1>out.cov

The region file should follow this format:

# region.txt
chr1:1-30200790
chr2:1-45040766
...

To genotype the variants, execute the command provided below:

# genotyping
graphtyper genotype_sv refgenome.fa input.vcf.gz --sam=out.bam --region_file=region.txt --output=out --avg_cov_by_readlen=out.cov

BayesTyper

To run BayesTyper, you will need to provide the following input files: a reference genome, a VCF file, a BAM file, and sample file hat specifies the prefix used by bayesTyperTools' makeBloom tool.

The sample file should follow this format:

# sample.tsv
out F out.bam

Before running BayesTyper, it is necessary to use KMC to count the k-mers in the BAM file.

# KMC
kmc -k55 -ci1 -t1 -fbam out.bam out.bam out.bam

Based on the generated k-mers, use bayesTyperTools makeBloom to construct a read Bloom filter.

# bayesTyperTools makeBloom
bayesTyperTools makeBloom -k out.bam -p 10

Then, use bayesTyper cluster for variants clustering.

# bayesTyper cluster
bayesTyper cluster -v input.vcf.gz -s sample.tsv -g refgenome.fa -p 10

Finally, use bayesTyper cluster for genotyping.

# bayesTyper cluster
for i in $(ls | grep 'bayestyper_unit_'); do
	bayesTyper genotype -v $i/variant_clusters.bin -c bayestyper_cluster_data -s sample.tsv -g refgenome.fa -o $i/bayestyper -z -p 10 --noise-genotyping
done

PanGenie

Running PanGenie requires only one simple command. However, as PanGenie cannot recognize compressed files, the first step is to decompress the input file before genotyping.

# decompress
gunzip input.vcf.gz
gunzip read.1.fq.gz
gunzip read.2.fq.gz
cat read.1.fq read.2.fq > out.fq
# PanGenie
PanGenie -s out -i out.fq -r refgenome.fa -v out.vcf -t 10 -j 10 -o out

Gramtools

The first step in running Gramtools is to build a genome graph.

# gramtools build
gramtools build --max_threads 10 -o ./build --ref refgenome.fa --vcf input.vcf.gz --force --debug

After which, you can perform genotyping using the following command:

# gramtools genotype
gramtools genotype --max_threads 10 --debug --force --sample_id out -i build --ploidy diploid -o out --reads read.1.fq.gz read.2.fq.gz