# VCFTools.jl

VCFTools.jl provides some Julia utilities for handling the VCF files.

In [None]:
# dispay Julia version info
versioninfo()

## Example VCF file

Current folder contains an example VCF file for demonstation.

In [None]:
;ls -l test.vcf.gz

Load the VCF file and display the first 35 lines

In [None]:
using VCFTools

fh = openvcf("test.vcf.gz", "r")
for l in 1:35
    println(readline(fh))
end
close(fh)

As in typical VCF files, it has a bunch of meta-information lines, one header line, and then one line for each each marker. In this VCF, genetic data has fields GT (genotype), DS (dosage), and GL (genotype likelihood).

## Summary statistics

* Number of records (markers) in a VCF file.

In [None]:
records = nrecords("test.vcf.gz")

* Number of samples (individuals) in a VCF file.

In [None]:
samples = nsamples("test.vcf.gz")

* `gtstats` function calculates genotype statistics for each marker with GT field.

In [None]:
@time records, samples, lines, missings_by_sample, missings_by_record, 
    maf_by_record, minorallele_by_record = gtstats("test.vcf.gz");

In [None]:
# number of markers
records

In [None]:
# number of samples (individuals)
samples

In [None]:
# number of markers with GT field
lines

In [None]:
# number of missing genotypes in each sample (individual)
missings_by_sample'

In [None]:
# number of missing genotypes in each marker with GT field
missings_by_record'

In [None]:
# minor allele frequency of each marker with GT field
maf_by_record'

In [None]:
# minor allele of each marker (with GT field): true (REF) or false (ALT)
minorallele_by_record'

The optional second argument of `gtstats` function specifies an output file or IO stream for genotype statistics per marker. Each line has fields:  
- 1-8:  VCF fixed fields (CHROM, POS, ID, REF, ALT, QUAL, FILT, INFO)
-   9:  Missing genotype count
-  10:  Missing genotype frequency
-  11:  ALT allele count
-  12:  ALT allele frequency
-  13:  Minor allele count             (REF allele vs ALT alleles)
-  14:  Minor allele frequency         (REF allele vs ALT alleles)
-  15:  HWE P-value                    (REF allele vs ALT alleles)

In [None]:
# write genotype statistics in file gtstats.out.txt
@time gtstats("test.vcf.gz", "gtstats.out.txt");

The output file can be read as a `DataFrame` for further analysis.

In [None]:
using CSV

gstat = CSV.read("gtstats.out.txt"; 
    header = [:chr, :pos, :id, :ref, :alt, :qual, :filt, :info, :missings, :missfreq, :nalt, :altfreq, :nminor, :maf, :hwe],
    delim = '\t',
)

## Filter

Sometimes we wish to subset entire VCF files, such as filtering out certain samples or records (SNPs). This is achieved via the filter function:

In [None]:
# filtering by specifying indices to keep
record_mask = 1:records       # keep all records (SNPs)
sample_mask = 2:(samples - 1) # keep all but first and last sample (individual)
@time VCFTools.filter("test.vcf.gz", record_mask, sample_mask, 
    des="filtered.test.vcf.gz")

One can also supply bitvectors as masks:

In [None]:
record_mask    = trues(records)
sample_mask    = trues(samples)
record_mask[1] = record_mask[end] = false
@time VCFTools.filter("test.vcf.gz", record_mask, sample_mask, 
    des="filtered.test.vcf.gz")

## Convert

Convert GT data in VCF file `test.vcf.gz` to a `Matrix{Union{Missing, Int8}}`. Here `as_minorallele = false` indicates that `VCFTools.jl` will copy the `0`s and `1`s of the file directly into `A`, without checking if ALT or REF is the minor allele. 

In [None]:
@time A = convert_gt(Int8, "test.vcf.gz"; as_minorallele = false, 
    model = :additive, impute = false, center = false, scale = false)

Convert GT data in VCF file `test.vcf.gz` to a numeric array. This checks which of `ALT/REF` is the minor allele, imputes the missing genotypes according to allele frequency, centers the dosages around 2MAF, and scales the dosages by `sqrt(2MAF*(1-MAF))`.

In [None]:
@time A = convert_gt(Float64, "test.vcf.gz"; as_minorallele = true, 
    model = :additive, impute = true, center = true, scale = true)

## Extract data marker-by-maker or window-by-window

Large VCF files easily generate numeric arrays that cannot fit into computer memory. Many analyses only need to loop over markers or sets of markers. Previous functions for importing genotypes/haplotypes/dosages have equivalent functions to achieve this:

+ `copy_gt!` loops over genotypes
+ `copy_ht!` loops over haplotypes
+ `copy_ds!` loops over dosages

For example, to loop over all genotype markers in the VCF file `test.vcf.gz`:

In [None]:
using GeneticVariation

# initialize VCF reader
people, snps = nsamples("test.vcf.gz"), nrecords("test.vcf.gz")
reader = VCF.Reader(openvcf("test.vcf.gz"))
# pre-allocate vector for marker data
g = zeros(Union{Missing, Float64}, people)
for j = 1:snps
    copy_gt!(g, reader; model = :additive, impute = true, center = true, scale = true)
    # do statistical anlaysis
end
close(reader)

To loop over markers in windows of size 25:

In [None]:
# initialize VCF reader
people, snps = nsamples("test.vcf.gz"), nrecords("test.vcf.gz")
reader = VCF.Reader(openvcf("test.vcf.gz"))
# pre-allocate matrix for marker data
windowsize = 25
g = zeros(Union{Missing, Float64}, people, windowsize)
nwindows = ceil(Int, snps / windowsize)
for j = 1:nwindows
    copy_gt!(g, reader; model = :additive, 
        impute = true, center = true, scale = true)
    # do statistical anlaysis
end
close(reader)