# VCF summary information

## Example VCF file

We need an example VCF file for demonstation. You can manually download it from [link](http://faculty.washington.edu/browning/beagle/test.08Jun17.d8b.vcf.gz) (877KB) and put the file in your current working directory. Or, within Julia, 

In [1]:
isfile("test.08Jun17.d8b.vcf.gz") || download("http://faculty.washington.edu/browning/beagle/test.08Jun17.d8b.vcf.gz", 
    joinpath(pwd(), "test.08Jun17.d8b.vcf.gz"))
stat("test.08Jun17.d8b.vcf.gz")

StatStruct(mode=0o100644, size=876514)

The first 35 lines of the VCF file are

In [2]:
using VCFTools

fh = openvcf("test.08Jun17.d8b.vcf.gz", "r")
for l in 1:35
    println(readline(fh))
end
close(fh)

##fileformat=VCFv4.1
##INFO=<ID=LDAF,Number=1,Type=Float,Description="MLE Allele Frequency Accounting for LD">
##INFO=<ID=AVGPOST,Number=1,Type=Float,Description="Average posterior probability from MaCH/Thunder">
##INFO=<ID=RSQ,Number=1,Type=Float,Description="Genotype imputation quality from MaCH/Thunder">
##INFO=<ID=ERATE,Number=1,Type=Float,Description="Per-marker Mutation rate from MaCH/Thunder">
##INFO=<ID=THETA,Number=1,Type=Float,Description="Per-marker Transition rate from MaCH/Thunder">
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants">
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=HOMLEN,Number=.,Type=Integer,Description="Length of base pair identical micro-homology at event breakpoints">
##INFO=<ID=HOMSEQ,Number=.,Type=String,Description="Seque

As in typical VCF files, it has a bunch of meta-information lines, one header line, and then one line for each each marker. In this VCF, genetic data has fields GT (genotype), DS (dosage), and GL (genotype likelihood).

## Summary statistics

* Number of records (markers) in a VCF file.

In [3]:
records = nrecords("test.08Jun17.d8b.vcf.gz")

1356

* Number of samples (individuals) in a VCF file.

In [4]:
samples = nsamples("test.08Jun17.d8b.vcf.gz")

191

* `gtstats` function calculates genotype statistics for each marker with GT field.

In [5]:
@time records, samples, lines, missings_by_sample, missings_by_record, 
    maf_by_record, minorallele_by_record = gtstats("test.08Jun17.d8b.vcf.gz");

  1.513952 seconds (6.10 M allocations: 395.394 MiB, 5.55% gc time)


In [6]:
# number of markers
records

1356

In [7]:
# number of samples (individuals)
samples

191

In [8]:
# number of markers with GT field
lines

1356

In [9]:
# number of missing genotypes in each sample (individual)
missings_by_sample'

1×191 LinearAlgebra.Adjoint{Int64,Array{Int64,1}}:
 0  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  0  0

In [10]:
# number of missing genotypes in each marker with GT field
missings_by_record'

1×1356 LinearAlgebra.Adjoint{Int64,Array{Int64,1}}:
 0  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  0  0

In [11]:
# minor allele frequency of each marker with GT field
maf_by_record'

1×1356 LinearAlgebra.Adjoint{Float64,Array{Float64,1}}:
 0.0  0.0  0.0  0.0  0.146597  0.0  …  0.0  0.0  0.0706806  0.0706806  0.0

In [12]:
# minor allele of each marker (with GT field): true (REF) or false (ALT)
minorallele_by_record'

1×1356 LinearAlgebra.Adjoint{Bool,Array{Bool,1}}:
 1  1  1  1  1  1  1  1  1  1  1  1  0  …  1  1  1  1  1  1  1  1  1  1  1  1

The optional second argument of `gtstats` function specifies an output file or IO stream for genotype statistics per marker. Each line has fields:  
- 1-8:  VCF fixed fields (CHROM, POS, ID, REF, ALT, QUAL, FILT, INFO)
-   9:  Missing genotype count
-  10:  Missing genotype frequency
-  11:  ALT allele count
-  12:  ALT allele frequency
-  13:  Minor allele count             (REF allele vs ALT alleles)
-  14:  Minor allele frequency         (REF allele vs ALT alleles)
-  15:  HWE P-value                    (REF allele vs ALT alleles)

In [13]:
# write genotype statistics in file gtstats.out.txt
@time gtstats("test.08Jun17.d8b.vcf.gz", "gtstats.out.txt");

  0.399222 seconds (1.82 M allocations: 180.814 MiB, 5.19% gc time)


The output file can be read as a `DataFrame` for further analysis.

In [14]:
using CSV

gstat = CSV.read("gtstats.out.txt"; 
    header = [:chr, :pos, :id, :ref, :alt, :qual, :filt, :info, :missings, :missfreq, :nalt, :altfreq, :nminor, :maf, :hwe],
    delim = '\t',
)

Unnamed: 0_level_0,chr,pos,id,ref,alt,qual,filt,info
Unnamed: 0_level_1,Int64,Int64,String,String,String,Float64,String,String
1,22,20000086,rs138720731,T,C,100.0,PASS,AC=7;RSQ=0.8454;AVGPOST=0.9983;AA=T;AN=2184;LDAF=0.0040;THETA=0.0001;VT=SNP;SNPSOURCE=LOWCOV;ERATE=0.0003;AF=0.0032;AFR_AF=0.01
2,22,20000146,rs73387790,G,A,100.0,PASS,LDAF=0.0169;RSQ=0.9482;THETA=0.0004;AA=G;AN=2184;AVGPOST=0.9972;VT=SNP;SNPSOURCE=LOWCOV;AC=36;ERATE=0.0003;AF=0.02;AFR_AF=0.07;EUR_AF=0.0013
3,22,20000199,rs183293480,A,C,100.0,PASS,LDAF=0.0009;THETA=0.0004;AN=2184;AVGPOST=0.9990;VT=SNP;AA=A;RSQ=0.6274;SNPSOURCE=LOWCOV;AC=1;ERATE=0.0003;AF=0.0005;EUR_AF=0.0013
4,22,20000291,rs185807825,G,T,100.0,PASS,ERATE=0.0005;AVGPOST=0.9983;AA=G;AN=2184;LDAF=0.0015;VT=SNP;SNPSOURCE=LOWCOV;RSQ=0.5564;AC=2;THETA=0.0003;AF=0.0009;ASN_AF=0.0035
5,22,20000428,rs55902548,G,T,100.0,PASS,AC=323;AVGPOST=0.9983;AA=G;AN=2184;VT=SNP;RSQ=0.9949;LDAF=0.1473;SNPSOURCE=LOWCOV;ERATE=0.0003;THETA=0.0003;AF=0.15;ASN_AF=0.0017;AMR_AF=0.15;AFR_AF=0.31;EUR_AF=0.15
6,22,20000683,rs142720028,A,G,100.0,PASS,AVGPOST=0.9985;AN=2184;LDAF=0.0015;VT=SNP;RSQ=0.5718;AA=A;SNPSOURCE=LOWCOV;THETA=0.0007;ERATE=0.0003;AC=2;AF=0.0009;AFR_AF=0.0041
7,22,20000771,rs114690707,A,C,100.0,PASS,ERATE=0.0004;AC=28;AN=2184;RSQ=0.9857;VT=SNP;AA=A;LDAF=0.0130;SNPSOURCE=LOWCOV;AVGPOST=0.9995;THETA=0.0003;AF=0.01;AMR_AF=0.01;AFR_AF=0.05
8,22,20000793,rs189842693,T,C,100.0,PASS,ERATE=0.0004;RSQ=0.7411;AA=T;AN=2184;AVGPOST=0.9981;AC=6;VT=SNP;SNPSOURCE=LOWCOV;LDAF=0.0031;THETA=0.0003;AF=0.0027;ASN_AF=0.0035;EUR_AF=0.01
9,22,20000810,rs147349046,C,T,100.0,PASS,AA=C;AVGPOST=0.9994;AC=28;AN=2184;VT=SNP;RSQ=0.9802;SNPSOURCE=LOWCOV;ERATE=0.0003;LDAF=0.0128;THETA=0.0003;AF=0.01;AMR_AF=0.01;AFR_AF=0.05
10,22,20000814,rs183154520,T,C,100.0,PASS,ERATE=0.0004;AVGPOST=0.9985;THETA=0.0002;AA=T;AN=2184;RSQ=0.4507;VT=SNP;SNPSOURCE=LOWCOV;AC=1;LDAF=0.0012;AF=0.0005;AMR_AF=0.0028
