# Easy Manipulation of Genetic Variant Data

For analysis of genetic variants, single nucleotide polymorphism (SNP) information is widely used. A SNP corresponds to a nucleotide position on the genome where some degree of variation has been observed in a population, with each individual have one of two possible alleles at that position on each of a pair of chromosomes. The two alleles are often distinguished as a reference allele (REF, allele on the [reference genome](https://en.wikipedia.org/wiki/Reference_genome)) and an alternate allele (ALT).  The most widely used repository for SNP information is [dbSNP](https://www.ncbi.nlm.nih.gov/snp/), where each SNP is indexed by an identifier beginning with `rs` ("rsID"). 

The _genotypes_ of each sample for each variant is commonly coded in:

| value | genotype |
|:---:|:---:|
| 0 | homozygous REF |
| 1 | heterozyguous REF/ALT |
| 2 | homozygous ALT |

Sometimes, when the genotype has uncertainties, it is represented in _dosage_ after imputation. Given the posterior probabilities of each genotype, dosage is computed as 
$$\mathrm{dosage = 0 \cdot Prob(REF/REF) + 1 \cdot Prob(REF/ALT) + 2 \cdot Prob(ALT/ALT)}$$
and have values in $[0, 2]$. 



Often, the data formats for genetic variants include
- Information of $m$ samples (identifier, sex, phenotypes, etc.)
- Information $n$ SNPs (identifier, chromosome, position, REF/ALT alleles, etc.)
- A $m \times n$ table or matrix containing the observed allelic type at $n$  positions for $m$ individuals.

A common type of analysis for this data is [genome-wide association studies (GWAS)](https://en.wikipedia.org/wiki/Genome-wide_association_study), often testing a statistical hypothesis variant by variant. Significance of each SNP is assessed by some type of regression:
$$
\mathrm{trait ∼ SNP + age + sex +  principal\;components + other\;covariates }
$$

In this workshop, we learn four widely-used file types for genetic variants and how to manipulate them in Julia.  



We focus on accessing genotype information variant by varint, as it is a common workflow for a GWAS-based application. We learn how do this and try to compute some simple properties of SNPs, such as minor allele frequencies (MAF).

## Genetic Variant File Formats


### Variant Call Format (`.vcf`)
Text-based format is the most intuitive to represent genetic variants. It is the most flexible format to include diverse information on the samples and variants.


- __Pros__: Highly flexible, easy to parse and interpret
- __Cons__: Large file size

With the scale of recent genetic data, with data set like UK Biobank having near a million subjects and millions of variants, the storage needed for storing a raw VCF file is prohibitively huge. A common approach to remedy this is by storing the data in a compressed form (such as `.gz` file) and decompress it at analysis time as a stream. However, there is a trade-off between stored file size and time needed for decompression.

- __Pros__: Reasonable file size with flexible representation
- __Cons__: Takes too long to decompress





#### Basic usage: `VCFTools.jl`

In [1]:
using VCFTools

fh = openvcf("test_vcf.vcf.gz", "r")
for l in 1:35
    println(readline(fh))
end
close(fh)

##fileformat=VCFv4.1
##INFO=<ID=LDAF,Number=1,Type=Float,Description="MLE Allele Frequency Accounting for LD">
##INFO=<ID=AVGPOST,Number=1,Type=Float,Description="Average posterior probability from MaCH/Thunder">
##INFO=<ID=RSQ,Number=1,Type=Float,Description="Genotype imputation quality from MaCH/Thunder">
##INFO=<ID=ERATE,Number=1,Type=Float,Description="Per-marker Mutation rate from MaCH/Thunder">
##INFO=<ID=THETA,Number=1,Type=Float,Description="Per-marker Transition rate from MaCH/Thunder">
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants">
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=HOMLEN,Number=.,Type=Integer,Description="Length of base pair identical micro-homology at event breakpoints">
##INFO=<ID=HOMSEQ,Number=.,Type=String,Description="Seque

22	20000199	rs183293480	A	C	100	PASS	LDAF=0.0009;THETA=0.0004;AN=2184;AVGPOST=0.9990;VT=SNP;AA=A;RSQ=0.6274;SNPSOURCE=LOWCOV;AC=1;ERATE=0.0003;AF=0.0005;EUR_AF=0.0013	GT:DS:GL	0/0:0.000:-0.00,-2.04,-5.00	0/0:0.000:-0.07,-0.82,-3.47	0/0:0.000:-0.07,-0.83,-5.00	0/0:0.000:-0.03,-1.12,-5.00	0/0:0.000:-0.11,-0.64,-4.10	0/0:0.000:-0.12,-0.62,-3.85	0/0:0.000:-0.01,-1.47,-5.00	0/0:0.000:-0.01,-1.54,-5.00	0/0:0.000:-0.10,-0.70,-4.70	0/0:0.000:-0.03,-1.18,-5.00	0/0:0.000:-0.16,-0.50,-3.30	0/0:0.000:-0.48,-0.48,-0.48	0/0:0.000:-0.03,-1.20,-5.00	0/0:0.000:-0.10,-0.70,-5.00	0/0:0.000:-0.19,-0.46,-2.46	0/0:0.000:-0.16,-0.51,-2.67	0/0:0.000:-0.00,-2.57,-5.00	0/0:0.000:-0.00,-2.85,-5.00	0/0:0.000:-0.48,-0.48,-0.48	0/0:0.000:-0.00,-2.55,-5.00	0/0:0.000:-0.00,-2.02,-5.00	0/0:0.050:-0.48,-0.48,-0.48	0/0:0.000:-0.23,-0.46,-1.24	0/0:0.000:-0.02,-1.45,-5.00	0/0:0.000:-0.10,-0.68,-4.40	0/0:0.000:-0.06,-0.88,-5.00	0/0:0.000:-0.13,-0.61,-2.23	0/0:0.000:-0.05,-0.94,-5.00	0/0:0.000:-0.26,-0.43,-1.11	0/0:0.000:-0

22	20000428	rs55902548	G	T	100	PASS	AC=323;AVGPOST=0.9983;AA=G;AN=2184;VT=SNP;RSQ=0.9949;LDAF=0.1473;SNPSOURCE=LOWCOV;ERATE=0.0003;THETA=0.0003;AF=0.15;ASN_AF=0.0017;AMR_AF=0.15;AFR_AF=0.31;EUR_AF=0.15	GT:DS:GL	1/0:1.000:-5.00,0.00,-5.00	0/0:0.000:-0.35,-0.43,-0.73	0/1:1.000:-1.81,-0.01,-2.95	0/0:0.000:-0.01,-1.79,-5.00	0/0:0.000:-0.06,-0.86,-5.00	1/0:1.000:-0.19,-0.46,-2.18	0/0:0.000:-0.10,-0.68,-5.00	0/1:1.000:-4.40,-0.03,-1.12	0/1:1.000:-5.00,-0.69,-0.10	0/0:0.000:-0.10,-0.69,-4.70	0/0:0.000:-0.48,-0.48,-0.48	0/1:1.000:-5.00,-0.01,-1.77	0/0:0.000:-0.18,-0.48,-2.57	0/0:0.000:-0.02,-1.31,-5.00	0/0:0.000:-0.11,-0.65,-4.70	0/0:0.000:-0.10,-0.68,-4.70	0/0:0.000:-0.01,-1.72,-5.00	1/0:1.000:-5.00,0.00,-5.00	1/0:1.000:-1.38,-0.02,-2.61	0/1:1.000:-5.00,-1.40,-0.02	0/0:0.000:-0.00,-2.97,-5.00	0/0:0.000:-0.19,-0.47,-2.15	0/0:0.000:-0.44,-0.46,-0.54	0/0:0.000:-0.00,-2.52,-5.00	0/0:0.000:-0.05,-0.93,-5.00	0/0:0.000:-0.01,-1.77,-5.00	0/1:0.750:-0.22,-0.46,-1.26	0/0:0.000:-0.03,-1.19,-5.00	0/0:0.0

As in typical VCF files, it has a bunch of meta-information lines, one header line, and then one line for each each marker. In this VCF, genetic data has fields GT (genotype), DS (dosage), and GL (genotype likelihood).

To access number of records and samples:

In [2]:
records = nrecords("test_vcf.vcf.gz")

1356

In [3]:
samples = nsamples("test_vcf.vcf.gz")

191

Information on samples and variants can be retrieved using the `VariantCallFormat` package:

Sample names can be retrieved by: 

In [4]:
using VariantCallFormat
reader = VCF.Reader(openvcf("test_vcf.vcf.gz"))
h = header(reader)
h.sampleID

191-element Vector{String}:
 "HG00096"
 "HG00097"
 "HG00099"
 "HG00100"
 "HG00101"
 "HG00102"
 "HG00103"
 "HG00104"
 "HG00106"
 "HG00108"
 "HG00109"
 "HG00110"
 "HG00111"
 ⋮
 "HG00383"
 "HG00384"
 "HG00403"
 "HG00404"
 "HG00406"
 "HG00407"
 "HG00418"
 "HG00419"
 "HG00421"
 "HG00422"
 "HG00427"
 "HG00428"

Information of each variant is accessible by:

In [5]:
reader = VCF.Reader(openvcf("test_vcf.vcf.gz"))
println("chrom\tposition\tids\treference\talternative")
cnt = 0
for record in reader
    println("$(VCF.chrom(record))\t$(VCF.pos(record))\t$(try VCF.id(record) catch; ["."] end)\t$(VCF.ref(record))\t$(VCF.alt(record))")
    cnt += 1
    if cnt == 30
        break
    end
end

chrom	position	ids	reference	alternative
22	20000086	["rs138720731"]	T	["C"]
22	20000146	["rs73387790"]	G	["A"]
22	20000199	["rs183293480"]	A	["C"]
22	20000291	["rs185807825"]	G	["T"]
22	20000428	["rs55902548"]	G	["T"]
22	20000683	["rs142720028"]	A	["G"]
22	20000771	["rs114690707"]	A	["C"]
22	20000793	["rs189842693"]	T	["C"]
22	20000810	["rs147349046"]	C	["T"]
22	20000814	["rs183154520"]	T	["C"]
22	20000864	["rs187930998"]	G	["A"]
22	20000882	["rs148068532"]	C	["G"]
22	20000950	["rs1978233"]	T	["G"]
22	20000975	["rs141800233"]	G	["A"]
22	20001001	["rs192051979"]	T	["C"]
22	20001006	["rs2079702"]	G	["A"]
22	20001016	["rs183256914"]	C	["T"]
22	20001157	["rs150580380"]	G	["A"]
22	20001159	["rs139570132"]	C	["T"]
22	20001219	["rs143369598"]	G	["C"]
22	20001333	["rs5993894"]	C	["T"]
22	20001434	["rs146344141"]	C	["T"]
22	20001455	["rs188666449"]	G	["A"]
22	20001521	["rs139601437"]	C	["A"]
22	20001587	["rs71788814"]	CAG	["C"]
22	20001600	["rs144217522"]	T	["A"]
22	20001655	["rs192606530"]	G	

Each row of a VCF file represents a single variant, so it is natural to parse the genotypes or dosages variant-by-variant. The function `copy_gt!()` computes genotypes of each variant, and `copy_ds!()` computes dosages of each variant, represented by values in $[0, 2]$. 


The keyword arguments: 
- `model` defines which genotype model to use. The common choice is `:additive` for 0/1/2 encoding. 
- `impute` sets whether to impute the missing values with the mean of nonmissing values.
- `sampleID` stores the list of sample ID.
-`record_chr`, `record_pos`, `record_ids`, `record_ref`, and `record_ref` stores chromosome, position, identifiers, reference allele, and alternative alleles, respectively.  

In [31]:
using VariantCallFormat
using Statistics
using StatsBase
# initialize VCF reader
people, snps = nsamples("test_vcf.vcf.gz"), nrecords("test_vcf.vcf.gz")
reader = VCF.Reader(openvcf("test_vcf.vcf.gz"))

# pre-allocate vector for marker data
g = zeros(Union{Missing, Float64}, people)
sampleID = Vector{String}(undef, people)
rec_chr = Vector{String}(undef, 1)
rec_pos = Vector{Any}(undef, 1)
rec_ids = Vector{Vector{String}}(undef, 1)
rec_ref = Vector{String}(undef, 1)
rec_alt = Vector{Vector{String}}(undef, 1)

for j = 1:30
    copy_gt!(g, reader; model = :additive, impute = true, 
        sampleID=sampleID,
        record_chr=rec_chr,
        record_pos=rec_pos,
        record_ids=rec_pos,
        record_ref=rec_ref,
        record_alt=rec_alt)
    println(mean(g))
end
close(reader)

0.0
0.0
0.0
0.0
0.2931937172774869
0.0
0.0
0.015706806282722512
0.0
0.0
0.005235602094240838
0.0
1.931937172774869
0.0
0.0
1.801047120418848
0.005235602094240838
0.0
0.0
0.015706806282722512
0.0
0.03664921465968586
0.0
0.0
0.0
0.0
0.0
0.04712041884816754
0.0
0.0


### PLINK 1 BED format (`.bed` + `.bim` and `.fam`)

In order to rapidly process the variant information while keeping the file size small, various binary file types are devised. This format is basically a compact array of two-bit representation of genotypes. 
This is the native format of the well-celebrated large-scale GWAS tool, [PLINK 1](https://www.cog-genomics.org/plink/). 
This is suitable for high-performance applications directly dealing with genotype matrices such as [iterative hard thresholding](https://github.com/OpenMendel/MendelIHT.jl) and [admixture analysis](https://github.com/OpenMendel/OpenADMIXTURE.jl).  Major weakness of this format is that it cannot contain imputed information when there is uncertainty in genotypes.

| Genotype | Plink/SnpArray |
|:---:|:---:|
| A1,A1 | 0x00 |
| missing | 0x01 |
| A1,A2 | 0x02 |
| A2,A2 | 0x03 |



- __Pros__: 
    - Small size. 1/32, 1/16, 1/4 compared to double-precision/single-precision/single character representation
    - Fixed-width format suitable for massively accelerated computation (e.g. GPU acceleration)
- __Cons__: 
    - Only contains hard-called genotypes (0, 1, 2, or missing), and cannot contain imputed values
    - Need external files for variant (`.bim`) and sample (`.fam`) information
    - REF/ALT designation unsupported
    
#### Basic usage: `SnpArrays.jl`

In [7]:
using SnpArrays
d = SnpData("test_bed")

SnpData(people: 191, snps: 1356,
snp_info: 
 Row │ chromosome  snpid        genetic_distance  position  allele1  allele2
     │ String      String       Float64           Int64     String   String
─────┼───────────────────────────────────────────────────────────────────────
   1 │ 22          rs138720731               0.0  20000086  C        T
   2 │ 22          rs73387790                0.0  20000146  A        G
   3 │ 22          rs183293480               0.0  20000199  C        A
   4 │ 22          rs185807825               0.0  20000291  T        G
   5 │ 22          rs55902548                0.0  20000428  T        G
   6 │ 22          rs142720028               0.0  20000683  G        A
…,
person_info: 
 Row │ fid        iid        father     mother     sex        phenotype
     │ Abstract…  Abstract…  Abstract…  Abstract…  Abstract…  Abstract…
─────┼──────────────────────────────────────────────────────────────────
   1 │ 0          HG00096    0          0          0          -9


Information of i-th sample is accessible by:

In [8]:
i = 20
d.person_info[i, :]

Unnamed: 0_level_0,fid,iid,father,mother,sex,phenotype
Unnamed: 0_level_1,Abstract…,Abstract…,Abstract…,Abstract…,Abstract…,Abstract…
20,0,HG00119,0,0,0,-9


Information of j-th variant is accessible by:

In [9]:
j = 7
d.snp_info[j, :]

Unnamed: 0_level_0,chromosome,snpid,genetic_distance,position,allele1,allele2
Unnamed: 0_level_1,String,String,Float64,Int64,String,String
7,22,rs114690707,0.0,20000771,C,A


Genotype of sample $i$, variant $j$ is accessible by:

In [10]:
d.snparray[i, j]

0x03

Note that this is the encoding defined by the table above. If converted numerically in `0`, `1`, `2`-encoding, it corresponds to `2`. 

Genotypes of each variant in numeric form can be accessed by:

In [11]:
g = Vector{Float64}(undef, d.people)
cnt = 0
for j in 1:30
    copyto!(g, @view(d.snparray[:, j]); impute=true, model=ADDITIVE_MODEL, center=false, scale=false)
    println(mean(g))
end

2.0
2.0
2.0
2.0
1.706806282722513
2.0
2.0
1.9842931937172774
2.0
2.0
1.9947643979057592
2.0
0.06806282722513089
2.0
2.0
0.19895287958115182
1.9947643979057592
2.0
2.0
1.9842931937172774
2.0
1.963350785340314
2.0
2.0
2.0
2.0
2.0
1.9528795811518325
2.0
2.0


Note that the count of the second allele is encoded, and it is often the reference allele. If it is desired to run the analyses based on the alternate allele count, the values have to be reversed (subtracted from `2.0`). This led to some projects putting the reference allele first, most notably UK Biobank. 

### Oxford BGEN format (`.bgen` + optional `.sample`)
The BGEN format is native to Oxford statistical genetics tools, such as [IMPUTE2](https://mathgen.stats.ox.ac.uk/impute/impute_v2.html) and [SNPTEST](https://www.well.ox.ac.uk/~gav/snptest/). This format employs variant-by-variant compression scheme, well-tailored for GWAS applications. The UK Biobank imputed data is distributed in this format. 

- __Pros__: 
    - Very small file size. Further compression from the fixed-width genotype representation
    - Admits variable precision.
    - Variant-by-variant compression suitable for GWAS, with dosage and phase information supported
    - Allows dosage and phase information
- __Cons__: 
    - Needs an external index for random access
    - Compression not tailored for genetic context
    - REF/ALT designation unsupported
    
#### Basic usage (`BGEN.jl`)

In [12]:
using BGEN
b = Bgen("test_bgen.bgen")



Bgen(IOStream(<file test_bgen.bgen>), 0x0000000000020c71, BGEN.Header(0x000006d3, 0x00000014, 0x0000054c, 0x000000bf, 0x02, 0x02, false), ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10"  …  "182", "183", "184", "185", "186", "187", "188", "189", "190", "191"], nothing)

Sample names are accessible by: 

In [13]:
BGEN.samples(b)

191-element Vector{String}:
 "1"
 "2"
 "3"
 "4"
 "5"
 "6"
 "7"
 "8"
 "9"
 "10"
 "11"
 "12"
 "13"
 ⋮
 "180"
 "181"
 "182"
 "183"
 "184"
 "185"
 "186"
 "187"
 "188"
 "189"
 "190"
 "191"

Since how to iterate across the variants in BGEN files depend on if the index file is provided, we support an iterator interface for variants. If you are familiar with Python, it is an interface similar to generator defined using `yield` statement in Python. This iterator is accessible by the function `iterator()`. 

In [32]:
println("rsid\tchrom\tpos\tn_alleles\tlist of alleles")
cnt = 0
for v in iterator(b)
    println("$(rsid(v::Variant))\t$(chrom(v))\t$(pos(v))\t$(n_alleles(v))\t$(alleles(v))")
    cnt += 1 
    if cnt == 30
        break
    end
end

rsid	chrom	pos	n_alleles	list of alleles
rs138720731	22	20000086	2	["C", "T"]
rs73387790	22	20000146	2	["A", "G"]
rs183293480	22	20000199	2	["C", "A"]
rs185807825	22	20000291	2	["T", "G"]
rs55902548	22	20000428	2	["T", "G"]
rs142720028	22	20000683	2	["G", "A"]
rs114690707	22	20000771	2	["C", "A"]
rs189842693	22	20000793	2	["C", "T"]
rs147349046	22	20000810	2	["T", "C"]
rs183154520	22	20000814	2	["C", "T"]
rs187930998	22	20000864	2	["A", "G"]
rs148068532	22	20000882	2	["G", "C"]
rs1978233	22	20000950	2	["G", "T"]
rs141800233	22	20000975	2	["A", "G"]
rs192051979	22	20001001	2	["C", "T"]
rs2079702	22	20001006	2	["A", "G"]
rs183256914	22	20001016	2	["T", "C"]
rs150580380	22	20001157	2	["A", "G"]
rs139570132	22	20001159	2	["T", "C"]
rs143369598	22	20001219	2	["C", "G"]
rs5993894	22	20001333	2	["T", "C"]
rs146344141	22	20001434	2	["T", "C"]
rs188666449	22	20001455	2	["A", "G"]
rs139601437	22	20001521	2	["A", "C"]
rs71788814	22	20001587	2	["C", "CAG"]
rs144217522	22	20001600	2	["A", "T"]
rs19

Dosage of each variant is accssible by: 

In [33]:
cnt = 0
for v in BGEN.iterator(b)
    g = first_allele_dosage!(b, v) # first allele is the ALT allele in this file.
    println(mean(g))
    cnt += 1
    if cnt == 30
        break
    end
end

0.0
0.0
0.0
0.0
0.29319373
0.0
0.0
0.015706806
0.0
0.0
0.005235602
0.0
1.9319372
0.0
0.0
1.8010471
0.005235602
0.0
0.0
0.015706806
0.0
0.036649216
0.0
0.0
0.0
0.0
0.0
0.04712042
0.0
0.0


### PLINK 2 PGEN format (`.pgen` + `.pvar` and `.psam`)

PGEN is a backward-compatible extension of the BED format for [PLINK 2](https://www.cog-genomics.org/plink/2.0/) under development. It tries to overcome the limitation of the BED format, and can incorporate phase and dosage information. Cutting-edge GWAS tools now support this format. 

- __Pros__: 
    - Very small file size, similar to BGEN.
    - Variant-by-variant compression suitable for GWAS. 
    - Allows dosage and phase information
    - Difflist or patch list: genetics-inspired compression incorporating linkage disequillibrium
        - 4x Faster than reading in `BGEN` with similar precision
- __Cons__: 
    - Still under development
    - Difficult to write a parser by admitting too many forms, including the whole BED format.
    - Requires external variant (`.pvar`) and sample (`.psam`) files. 
    
Note: [Linkage disequilibrium](https://www.sciencedirect.com/topics/neuroscience/linkage-disequilibrium)

> Linkage disequilibrium (LD) is the correlation between nearby variants such that the alleles at neighboring polymorphisms (observed on the same chromosome) are associated within a population more often than if they were unlinked.

#### Basic usage

In [16]:
using PGENFiles



In [17]:
p = Pgen("test_pgen.pgen");

In [18]:
print(PGENFiles.n_variants(p))

1356

In [19]:
print(PGENFiles.n_samples(p))

191

In [20]:
v_iter = PGENFiles.iterator(p);

In [21]:
v = first(v_iter)
g, data, offset = get_genotypes(p, v)

(UInt8[0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00  …  0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00], UInt8[0x00], 0x0000000000000001)

`g` is the genotype vector, `data` is the variant record, and `offset` denotes the end of dosage record on the current variant record track.

In [22]:
g

191-element Vector{UInt8}:
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
    ⋮
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00

The encoding is as follows:

| genotype code |	genotype category |
|:---:|:---:|
| `0x00` | 	homozygous REF |
| `0x01` |	heterozygous REF-ALT |
| `0x02` |	homozygous ALT |
| `0x03` |	missing |

To avoid array allocations for iterative genotype extraction, one may preallocate `g` and reuse it.

In [23]:
g = Vector{UInt8}(undef, PGENFiles.n_samples(p))
cnt = 0
for v in PGENFiles.iterator(p)
    get_genotypes!(g, p, v)
    println(mean(g))
    # do someting with genotypes in `g`...
    cnt += 1
    if cnt == 30
        break
    end
end

0.0
0.0
0.0
0.0
0.2931937172774869
0.0
0.0
0.015706806282722512
0.0
0.0
0.005235602094240838
0.0
1.931937172774869
0.0
0.0
1.801047120418848
0.005235602094240838
0.0
0.0
0.015706806282722512
0.0
0.03664921465968586
0.0
0.0
0.0
0.0
0.0
0.04712041884816754
0.0
0.0


Similarly, ALT allele dosages are available through the functions `alt_allele_dosage()` and `alt_allele_dosage!()`. As genotype information is required to retrieve dosages, space for genotypes are also required for `alt_allele_dosage!()`. These functions return dosages, parsed genotypes, and endpoint of the dosage information in the current variant record.

In [24]:
d = Vector{Float32}(undef, PGENFiles.n_samples(p))
g = Vector{UInt8}(undef, PGENFiles.n_samples(p))
g_ld = similar(g)
cnt = 0
for v in v_iter
    cnt += 1
    alt_allele_dosage!(d, g, p, v)
    println(mean(d))
    # do someting with dosage values in `d`...
    if cnt == 30
        break
    end
end

0.0
0.0
0.0
0.0
0.29319373
0.0
0.0
0.015706806
0.0
0.0
0.005235602
0.0
1.9319372
0.0
0.0
1.8010471
0.005235602
0.0
0.0
0.015706806
0.0
0.036649216
0.0
0.0
0.0
0.0
0.0
0.04712042
0.0
0.0


Information of each sample and variant is available by reading in `.psam` and `.pvar` file as a `DataFrame` (not yet supported by `PGENFiles.jl`). `.pvar` format admits regular `.vcf` format.

In [25]:
using DataFrames, CSV

In [26]:
sample_info = CSV.read("test_pgen.psam", DataFrame)
first(sample_info, 5)

Unnamed: 0_level_0,#IID,SEX
Unnamed: 0_level_1,String7,String3
1,HG00096,
2,HG00097,
3,HG00099,
4,HG00100,
5,HG00101,


For `.pvar` file, all the lines starting with `#` is header, ending with the column names. For `test_pgen.pvar`, it is the 26th line. 

In [27]:
variant_info = CSV.read("test_pgen.pvar", DataFrame; delim="\t", header=26)
first(variant_info, 5)

Unnamed: 0_level_0,#CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO
Unnamed: 0_level_1,Int64,Int64,String15,String15,String31,String7,String7,String
1,22,20000086,rs138720731,T,C,100,PASS,AC=7;RSQ=0.8454;AVGPOST=0.9983;AA=T;AN=2184;LDAF=0.0040;THETA=0.0001;VT=SNP;SNPSOURCE=LOWCOV;ERATE=0.0003;AF=0.0032;AFR_AF=0.01
2,22,20000146,rs73387790,G,A,100,PASS,LDAF=0.0169;RSQ=0.9482;THETA=0.0004;AA=G;AN=2184;AVGPOST=0.9972;VT=SNP;SNPSOURCE=LOWCOV;AC=36;ERATE=0.0003;AF=0.02;AFR_AF=0.07;EUR_AF=0.0013
3,22,20000199,rs183293480,A,C,100,PASS,LDAF=0.0009;THETA=0.0004;AN=2184;AVGPOST=0.9990;VT=SNP;AA=A;RSQ=0.6274;SNPSOURCE=LOWCOV;AC=1;ERATE=0.0003;AF=0.0005;EUR_AF=0.0013
4,22,20000291,rs185807825,G,T,100,PASS,ERATE=0.0005;AVGPOST=0.9983;AA=G;AN=2184;LDAF=0.0015;VT=SNP;SNPSOURCE=LOWCOV;RSQ=0.5564;AC=2;THETA=0.0003;AF=0.0009;ASN_AF=0.0035
5,22,20000428,rs55902548,G,T,100,PASS,AC=323;AVGPOST=0.9983;AA=G;AN=2184;VT=SNP;RSQ=0.9949;LDAF=0.1473;SNPSOURCE=LOWCOV;ERATE=0.0003;THETA=0.0003;AF=0.15;ASN_AF=0.0017;AMR_AF=0.15;AFR_AF=0.31;EUR_AF=0.15


## Exercise

[Minor allele frequency](https://en.wikipedia.org/wiki/Minor_allele_frequency) (MAF) is widely used for determining if a variant is rare or frequent, and in GWAS, it is used for measuring information content of a variant. One example is the ratio of actual numerical variance and expected variance from the binomial model ($2 \hat{p}(1-\hat{p})$, where $\hat{p}$ is the MAF). How would you compute minor allele frequencies using the packages we used? How would you determine the minor/major allele? How would you compute the variance ratio measure?

## File format transformation

The file formats can be transformed between each other using plink2 on the command line. For example, the files used in this workshop is transformed from a VCF file using the following commands.

In [28]:
run(`plink2 --vcf test_vcf.vcf.gz --export bgen-1.3 --out test_bgen`) # ALT allele comes first

PLINK v2.00a2.3 64-bit (24 Jan 2020)           www.cog-genomics.org/plink/2.0/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to test_bgen.log.
Options in effect:
  --export bgen-1.3
  --out test_bgen
  --vcf test_vcf.vcf.gz

Start time: Wed Jul 13 14:23:24 2022
16384 MiB RAM detected; reserving 8192 MiB for main workspace.
Using up to 8 compute threads.
--vcf: 1k variants scanned.--vcf: 1356 variants scanned.
--vcf: 0k variants converted.    --vcf: test_bgen-temporary.pgen + test_bgen-temporary.pvar +
test_bgen-temporary.psam written.
191 samples (0 females, 0 males, 191 ambiguous; 191 founders) loaded from
test_bgen-temporary.psam.
1356 variants loaded from test_bgen-temporary.pvar.
Note: No phenotype data present.
Writing test_bgen.bgen ... 0%done.
Writing test_bgen.sample ... done.
End time: Wed Jul 13 14:23:24 2022


Process(`[4mplink2[24m [4m--vcf[24m [4mtest_vcf.vcf.gz[24m [4m--export[24m [4mbgen-1.3[24m [4m--out[24m [4mtest_bgen[24m`, ProcessExited(0))

In [29]:
run(`plink2 --vcf test_vcf.vcf.gz --make-bed --out test_bed`) # ALT allele comes first

PLINK v2.00a2.3 64-bit (24 Jan 2020)           www.cog-genomics.org/plink/2.0/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to test_bed.log.
Options in effect:
  --make-bed
  --out test_bed
  --vcf test_vcf.vcf.gz

Start time: Wed Jul 13 14:23:24 2022
16384 MiB RAM detected; reserving 8192 MiB for main workspace.
Using up to 8 compute threads.
--vcf: 1k variants scanned.--vcf: 1356 variants scanned.
--vcf: 0k variants converted.    --vcf: test_bed-temporary.pgen + test_bed-temporary.pvar +
test_bed-temporary.psam written.
191 samples (0 females, 0 males, 191 ambiguous; 191 founders) loaded from
test_bed-temporary.psam.
1356 variants loaded from test_bed-temporary.pvar.
Note: No phenotype data present.
Writing test_bed.fam ... done.
Writing test_bed.bim ... done.
Writing test_bed.bed ... 0%done.
End time: Wed Jul 13 14:23:25 2022


Process(`[4mplink2[24m [4m--vcf[24m [4mtest_vcf.vcf.gz[24m [4m--make-bed[24m [4m--out[24m [4mtest_bed[24m`, ProcessExited(0))

In [30]:
run(`plink2 --vcf test_vcf.vcf.gz --make-pgen --out test_pgen`)

PLINK v2.00a2.3 64-bit (24 Jan 2020)           www.cog-genomics.org/plink/2.0/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to test_pgen.log.
Options in effect:
  --make-pgen
  --out test_pgen
  --vcf test_vcf.vcf.gz

Start time: Wed Jul 13 14:23:25 2022
16384 MiB RAM detected; reserving 8192 MiB for main workspace.
Using up to 8 compute threads.
--vcf: 1k variants scanned.--vcf: 1356 variants scanned.
--vcf: 0k variants converted.    --vcf: test_pgen-temporary.pgen + test_pgen-temporary.pvar +
test_pgen-temporary.psam written.
191 samples (0 females, 0 males, 191 ambiguous; 191 founders) loaded from
test_pgen-temporary.psam.
1356 variants loaded from test_pgen-temporary.pvar.
Note: No phenotype data present.
Writing test_pgen.psam ... done.
Writing test_pgen.pvar ... 0%0%1%1%2%2%3%3%4%4%5%5%6%6%7%7%8%8%9%9%10%10%11%11%12%12%13%13%14%14%15%15%16%16%17%1

Process(`[4mplink2[24m [4m--vcf[24m [4mtest_vcf.vcf.gz[24m [4m--make-pgen[24m [4m--out[24m [4mtest_pgen[24m`, ProcessExited(0))

## Concluding Remarks: Applications in OpenMendel

While we focus on variant-by-variant access of the genetic variant data, many of the packages we have seen has various utility functionalities such as filtering out the variants with low MAF or low genotype success rate, filter by chromosome, and merge. In case of the BED format (`SnpArrays.jl`), we support high-performance linear algebra on the genotype matrix which supports multithreading and GPU computation. 

### GWAS application

- `OrdinalGWAS.jl` : GWAS for ordered categorical trait, e.g. disease status (undiagnosed, pre-disease, mild, moderate, severe)
- `TrajGWAS.jl` : GWAS for continuous longitudinal phenotypes using a modified linear mixed effects model. Tests both mean effect and within-subject variability effect.


### High-performance applications using fixed-width BED format

- `MendelIHT.jl` : Rather than GWAS with variant-by-variant testing, uses penalized regression model. 
- `OpenADMIXTURE.jl` : Estimates ancestry of samples. Julia reimplementation of highly celebrated ADMIXTURE software in C++, 8x faster in multi-threaded setting and 35x faster using GPU.


### Others
- `VarianceComponentModels.jl` : Fitting and testing variance component models
- `MendelImpute.jl` : Fast phase inference and genotype imputation
- `TraitSimulation.jl` : Quickly simulate phenotypes under a variety of genetic architectures.
- `QuasiCopula.jl` : Analysis of correlated data with specified marginals with a flexible quasi-copula distribution