Skip to content

User guide

Hannes Pétur Eggertsson edited this page Aug 28, 2023 · 3 revisions

graphtyper user guide

Genotyping

Small variants (SNP/indels)

The command to run small variant genotyping is:

graphtyper genotype <REFERENCE.fa> --sams=<BAMLIST_OR_CRAMLIST> --region=<chrA:begin-end>

where REFERENCE.fa is the FASTA reference genome, BAMLIST_OR_CRAMLIST are the input BAM/CRAM files (one per line), and T is the maximum amount of threads you wish to allocate. Note that T should be equal or lower than your number of input SAM files as there is always just one thread reading each SAM file. For more details and other options see the help page

./graphtyper genotype --help
./graphtyper genotype --advanced --help # To see advanced options

Structural variants (SVs)

For SV genotyping you should instead use genotype_sv subcommand.

./graphtyper genotype_sv <REFERENCE.fa> <input.vcf.gz> --sams=<BAMLIST_OR_CRAMLIST> --region=<chrA:begin-end> --threads=<T>

For more details and other options see the help page

./graphtyper genotype_sv --help
./graphtyper genotype --advanced --help # To see advanced options

In our experience, discovering structural variants (SVs) using Manta diploidSV.vcf.gz yields the best results as Manta provides very detailed information on the exact breakpoint sequence of SVs - some of which are essential for graphtyper. Use svimmer to merge candidate from multiple diploadSV files together.

Tweaking graphtyper

Subsampling reads in abnormally high sequence depth

Some regions of the genome abnormally high sequence depth, even ten times the average coverage or more. You may improve the overall time of graphtyper by subsampling reads in such regions using --avg_cov_by_readlen=FILE

The FILE in --avg_cov_by_readlen=FILE which should contain the average coverage divided by read length for each BAM/CRAM (one value per line). The list provided in --sams=FILE should be in the same order as the avg_cov_by_readlen file. So for example if have a sample with 30x coverage with 100bp reads you should put the value "0.3". You may calculate the values for an input BAMLIST using:

parallel -k "samtools idxstats {1} | head -n -1 | awk '{sum+=\$3+\$4; ref+=\$2} END{print sum/ref}'" :::: ${BAMLIST} > ${BAMLIST}.avg_cov_by_readlen

Reduce graphtyper memory consumption

If graphtyper is using too much memory for your system you might try setting a lower --max_files_open=VAL value (the trade off is a bit slower running speed). Set it to, for example, two times your CPU thread count. This advanced option will determine how many files each tread can open at the same time and a lower value forces graphtyper to processess fewer individuals at a time.

Post process tasks

Concatenate output VCF files

The graphtyper output files will be in small regions. If you only have one or a few sample you might want to concatenate them and for that you can use the bcftools concat --naive command, for example

$ echo chr{1..22} chrX | tr ' ' '\n' | while read chrom; do if [[ ! -d results/${chrom} ]]; then continue; fi; find results/${chrom} -name "*.vcf.gz" | sort; done > vcf_file_list
$ bcftools concat --naive --file-list vcf_file_list -Oz -o large.vcf.gz
$ tabix -p vcf large.vcf.gz

Note: replace results with sv_results if you are genotyping SVs.

SNP and indel filtering

See SNP and indel filtering page

The INFO/AAScore has a score between 0 and 1, higher score indicates a better alternative allele quality. Consider removing anything from your analysis with a score below 0.5, or find another threshold that suits your application.