# Genotype VCF file quality control

This implements some recommendations from UK Biobank on [exome sequence data quality control](https://www.medrxiv.org/content/10.1101/2020.11.02.20222232v1.full-text).

## Overview

The goal of this module is to perform QC on VCF files, including 

1. Handling the formatting of multi-allelic sites, 
2. Genotype and variant level filtering based on genotype calling qualities. 
3. Known/novel variants annotation
4. Summary statistics before and after QC, in particular the ts/tv ratio, to assess the effectiveness of QC.

3 and 4 above are for explorative analysis on the overall quality assessment of genotype data in the VCF files. We annotate known and novel variants because ts/tv are expected to be different between known and novel variants, and is important QC metric to assess the effectiveness of our QC.

### Multi-allelic sites

Mult-allelic sites can be problematic in many ways for downstreams analysis, even of they are handled in terms of formatting after QC. We provide an optional workflow module to keep only bi-allelic sites from data, although by default we will include these sites in the VCF file we generate.

## Default VCF QC filters


1. Genotype depth filters: SNPs DP>10 and Indels DP>10 for indels.
2. At least one sample per site passed the allele balance threshold >= 0.15 for SNPs and >=0.20 for indels (heterozygous variants). 
    - Allele balance is calculated for heterozygotes as the number of bases supporting the least-represented allele over the total number of base observations.
3. Genotype quality GQ>20.

Filtering are done with `bcftools`. Here is a [useful cheatsheet from github user @elowy01](https://gist.github.com/elowy01/93922762e131d7abd3c7e8e166a74a0b).

## A note on TS/TV summary from VCF genotype data

`bcftools stats` command provides useful summary statistics including TS/TV ratio, which is routinely used as a quality measure of variant calls. With dbSNP based annotation of novel and known variants, `bcftools` can compute TS/TV for novel and known variants at variant level, and at sample level. It should be noted that variant level TS/TV does not take sample genotype into consideration -- it simply counts the TS and TV event for observed SNPs in the data. Other tools, such as `snpsift`, implements variant level TS/TV by counting TS and TV events in sample genotypes and compute the ratio after summing up TS and TV across all samples. See [here](https://github.com/samtools/bcftools/issues/1526) some discussions on this issue. We provide these TS/TV calculations before and after QC but users should be aware of the difference when interpreting the results.

## Input

1. The target `VCF` file
    - If its chromosome name does not have the `chr` prefix and you need it to match with reference `fasta` file, please run `rename_chrs` workflow to add `chr`.
2. dbSNP database in `VCF` format
3. A reference sequence `fasta` file

## Output
1. QC-ed genotype data in VCF and in PLINK format
2. A set of sumstats to help evaluate quality of genotype before and after QC
    - Particularly useful is the TS/TV ratio

## Minimal working example
The MWE is generated via 
```
bcftools query -l get-dosage.ALL.vcf.gz | head -40 > MWE_sample_list
bcftools view -S MWE_sample_list  get-dosage.ALL.vcf.gz > sample_filtered.vcf &
bgzip -c sample_filtered.vcf >  sample_filtered.vcf.gz
tabix -p vcf sample_filtered.vcf.gz
bcftools view --regions chr1 sample_filtered.vcf.gz > chr1_sample_filtered.vcf &
cat chr1_sample_filtered.vcf | head -20000 > MWE_genotype.vcf
```
and was stored here: https://drive.google.com/file/d/1sxxPdPIyKma0mAl8TKwhgyRHlOh0Oyrc/view?usp=sharing

The MWE was used as follows:

```
sos run VCF_QC.ipynb rename_chrs \
    --genoFile reference_data/00-All.vcf.gz \
    --cwd reference_data --container ./bioinfo.sif
```

```
sos run VCF_QC.ipynb dbsnp_annotate \
    --genoFile reference_data/00-All.add_chr.vcf.gz \
    --cwd reference_data --container ./bioinfo.sif
```


```
sos run VCF_QC.ipynb qc    \
--genoFile data/MWE/MWE_genotype.vcf     \
--dbsnp-variants data/reference_data/00-All.add_chr.variants.gz  \
--reference-genome data/reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta   \
--cwd MWE/output/genotype_1 --container ./bioinfo.sif -J 1 -c csg.yml -q csg  &
```
To produce the following results:

- Total TS/TV for 19639 known variants before QC: 2.599
- Total TS/TV for 19573 known variants after QC: 2.600
- There is no novel variants included in the MWE.

The Total TS/TV is extracted from the last step of QC. For known variant before QC:

In [13]:
grep Ts/Tv MWE_genotype.leftnorm.known_variant.snipsift_tstv | rev | cut -d',' -f1 | rev

2.599


For known variant after QC:

In [4]:
grep Ts/Tv MWE_genotype.leftnorm.filtered.*_variant.snipsift_tstv | rev | cut -d',' -f1 | rev

2.600


For novel variant before/after QC, TS/TV is not avaible since no novel_variants presented in the MWE

In [None]:
grep Ts/Tv MWE_genotype.leftnorm.novel_variant.snipsift_tstv | rev | cut -d',' -f1 | rev
grep Ts/Tv MWE_genotype.leftnorm.filtered.novel_variant.snipsift_tstv | rev | cut -d',' -f1 | rev

## Command Interface

In [1]:
sos run VCF_QC.ipynb -h

usage: sos run VCF_QC.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  rename_chrs
  dbsnp_annotate
  qc

Global Workflow Options:
  --genoFile VAL (as path, required)
                        input
  --cwd VAL (as path, required)
                        Workdir
  --numThreads 1 (as int)
                        Number of threads
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 5h
                        Walltime
  --mem 60G
  --container ''
                        Software container option

Sections
  rename_chrs:
  dbsnp_annotate:
  qc_1:                 Handel multi-allelic sites, left normalization of indels
         

## Global parameters

In [1]:
[global]
# input
parameter: genoFile = path
# Workdir
parameter: cwd = path
# Number of threads
parameter: numThreads = 1
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Walltime 
parameter: walltime = '5h'
parameter: mem = '60G'
# Software container option
parameter: container = ""
# use this function to edit memory string for PLINK input
from sos.utils import expand_size
cwd = path(f"{cwd:a}")

## Annotation of known and novel variants

The known variant reference can be downloaded from https://ftp.ncbi.nlm.nih.gov/snp/organisms//human_9606_b150_GRCh38p7/VCF/00-All.vcf.gz.

The procedure/rationale is [explained in this post](https://hbctraining.github.io/In-depth-NGS-Data-Analysis-Course/sessionVI/lessons/03_annotation-snpeff.html).

It takes ~1hr for `rename_chrs` to complete.

In [None]:
[rename_chrs]
# This file can be downloaded from https://ftp.ncbi.nlm.nih.gov/snp/organisms//human_9606_b150_GRCh38p7/VCF/00-All.vcf.gz.
input: genoFile
output: f"{cwd}/{_input:bnn}.add_chr.vcf.gz"
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container = container, expand= "${ }", stderr = f'{_output:nn}.stderr', stdout = f'{_output:nn}.stdout'
    for i in {1..22} X Y MT; do echo "$i chr$i"; done > ${_output:nn}.chr_name_conv.txt
    bcftools annotate --rename-chrs ${_output:nn}.chr_name_conv.txt ${_input} -Oz -o ${_output}
    tabix -p vcf ${_output}
    rm -f ${_output:nn}.chr_name_conv.txt

In [None]:
[dbsnp_annotate]
output: f"{_input:nn}.variants.gz"
task: trunk_workers = 1, trunk_size=5, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container = container, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    bcftools query  -f'%CHROM\t%POS\t%ID\t%REF\t%ALT\n' ${_input}  | \
        awk 'BEGIN{OFS="\t";} {if (length ($4) > length ($5)) {print $1,$2,$2+ (length ($4) - 1),$3} else {print $1,$2, $2 + (length ($4) -1 ),$3}}' | \
        bgzip -c > ${_output}

## Genotype QC

This step handles multi-allelic sites and annotate variants to known and novel. We add an RS ID to variants in dbSNP. Variants without rsID are considered novel variants.

In [4]:
# Handel multi-allelic sites, left normalization of indels and add variant ID
[qc_1 (variant preprocessing)]
# Path to dbSNP variants generated previously
parameter: dbsnp_variants = path
# Path to fasta file for HG reference genome, eg GRCh38_full_analysis_set_plus_decoy_hla.fa
parameter: reference_genome = path
parameter: bi_allelic = False
parameter: snp_only = False
input: genoFile, group_by = 1
output: f'{cwd}/{_input:bnn}.{"leftnorm" if not bi_allelic else "biallelic"}{".snp" if snp_only else ""}.vcf.gz'
task: trunk_workers = 1, trunk_size=job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container = container, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
        ${'bcftools norm -m-any' if not bi_allelic else 'bcftools view -m2 -M2'} ${'-v snps' if snp_only else ""} ${_input} |\
        bcftools norm --check-ref w -f ${reference_genome}  -Oz|\
        bcftools +fill-tags -- -t all,F_MISSING,'VD=sum(DP)' | \
        bcftools annotate -x ID -I +'%CHROM:%POS:%REF:%ALT' | \
        bcftools annotate -a ${dbsnp_variants}  -h <(echo '##INFO=<ID=RSID,Number=1,Type=String,Description="dbSNP rsID">') -c CHROM,FROM,TO,ID -Oz > ${_output}

This step filter variants based on FILTER PASS, DP and QC, fraction of missing genotypes (all samples), and on HWE, for snps and indels. It will also remove monomorphic sites -- using `bcftools view -c1`.

In [3]:
# genotype QC
[qc_2 (variant level QC)]
# Maximum missingess per-variant
parameter: geno_filter = 0.1
# Sample level QC - read depth (DP) to filter out SNPs below this value
parameter: DP_snp = 10
# Sample level QC - genotype quality (GQ) of specific sample. This measure tells you how confident we are that the genotype we assigned to a particular sample is correct
parameter: GQ = 20
# Sample level QC - read depth (DP) to filter out indels below this value
parameter: DP_indel = 10
# Allele balance for snps
parameter: AB_snp = 0.15
# Allele balance for indels
parameter: AB_indel = 0.2
# HWE filter 
parameter: hwe_filter = 1e-06
output: f"{_input:nn}.filtered.vcf.gz"
task: trunk_workers = 1, trunk_size=job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container = container, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    bcftools filter -S . -e '(TYPE="SNP" & (FMT/DP)<${DP_snp} & (FMT/GQ)<${GQ})|(TYPE="INDEL" & (FMT/DP)<${DP_indel} & (FMT/GQ)<${GQ})' ${_input} | \
    bcftools view -c1  | bcftools view -f PASS | \
    bcftools filter -i 'GT="hom" | TYPE="snp" & GT="het" & (FORMAT/AD[*:1])/(FORMAT/AD[*:0] + FORMAT/AD[*:1]) >= ${AB_snp} | TYPE="indel" & GT="het" & (FORMAT/AD[*:1])/(FORMAT/AD[*:0] + FORMAT/AD[*:1]) >= ${AB_indel}' | \
    bcftools filter -i 'F_MISSING<${geno_filter} & HWE>${hwe_filter}' -Oz -o ${_output} 

Finally we export it to PLINK 1.0 format, **without keeping allele orders**. Notice that PLINK 1.0 format does not allow for dosages. PLINK 2.0 format support it, but it is generally not supported by downstreams data analysis.  

In the following code block the option `--vcf-half-call m`  treat half-call as missing.

Also, intentionally, `--keep-allele-order` is not applied. The resulting PLINK will lose ref/alt allele information but will go by major/minor allele, as conventionally used in standard PLINK format.

In [None]:
[qc_3 (export to PLINK)]
output: f'{_input:nn}.bed'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container = container, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    plink --vcf ${_input} \
        --vcf-half-call m \
        --vcf-require-gt \
        --allow-extra-chr \
        --make-bed --out ${_output:n}

In [None]:
[qc_4 (genotype data summary statistics)]
input: output_from('qc_1'), output_from('qc_2'), group_by = 1
output: f"{cwd}/{_input:bnn}.novel_variant_sumstats", 
        f"{cwd}/{_input:bnn}.known_variant_sumstats", 
        f"{cwd}/{_input:bnn}.novel_variant.snipsift_tstv",
        f"{cwd}/{_input:bnn}.known_variant.snipsift_tstv"
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: container = container, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'
    # Compute summary statistics, including TS/TV
    bcftools stats -i 'ID="."' -v  ${_input} > ${_output[0]}
    bcftools stats -i 'ID!="."' -v  ${_input} > ${_output[1]}
    bcftools filter -i 'ID="."'  ${_input}   | java -jar /opt/snpEff/SnpSift.jar tstv - > ${_output[2]}
    bcftools filter -i 'ID!="."' ${_input}  | java -jar /opt/snpEff/SnpSift.jar tstv - > ${_output[3]}