# SnpArrays.jl

Data from [*genome-wide association studies (GWAS)*](https://en.wikipedia.org/wiki/Genome-wide_association_study) are often saved as a [**PLINK binary biallelic genotype table**](https://www.cog-genomics.org/plink2/formats#bed) or `.bed` file. To be useful, such files should be accompanied by a `.fam` file, containing metadata on the rows of the table, and a `.bim` file,
containing metadata on the columns. The `.fam` and `.bim` files are in tab-separated format.

The table contains the observed allelic type at `n` [*single nucleotide polymorphism*](https://en.wikipedia.org/wiki/Single-nucleotide_polymorphism) (SNP) positions for `m` individuals. A SNP corresponds to a nucleotide position on the genome where some degree of variation has been observed in a population, with each individual have one of two possible *alleles* at that position on each of a pair of chromosomes. Three possible genotypes and corresponding coding are

| Genotype | Plink/SnpArray |  
|:---:|:---:|  
| A1,A1 | 0x00 |  
| missing | 0x01 |
| A1,A2 | 0x02 |  
| A2,A2 | 0x03 |  

## Installation

This package requires Julia v1.4 or later, which can be obtained from
<https://julialang.org/downloads/> or by building Julia from the sources in the
<https://github.com/JuliaLang/julia> repository.

The package can be installed by running the following code:
```julia
using Pkg
pkg"add SnpArrays"
```
For running the examples below, the following are also necessary. 
```julia
pkg"add BenchmarkTools DelimitedFiles Glob"
pkg"add https://github.com/OpenMendel/ADMIXTURE.jl"
```

For optional use on a CUDA-enabled GPU, the following is also needed. 
```julia
pkg"add Adapt CUDA"
```

In [None]:
versioninfo()

In [None]:
# for use in this tutorial
using SnpArrays, OpenADMIXTURE, BenchmarkTools, DelimitedFiles, Glob, Random
Sys.islinux() && (using CUDA);

## Example data

There are two example data sets attached to this package. They are available in the `data` folder of the package.

In [None]:
datapath = normpath(SnpArrays.datadir())

In [None]:
readdir(glob"mouse.*", datapath)

Data set `EUR_subset` contains no missing genotypes. It is located at

In [None]:
readdir(glob"EUR_subset.*", datapath)

Data from recent studies, which have samples from tens of thousands of individuals at over a million SNP positions, would be in the tens or even hundreds of Gb range.

## SnpArray

`SnpArray` is the fundamental type for dealing with genotype data in Plink bed file. Each row of `SnpArray` is a sample and each column a SNP.

### Constructor

There are various ways to initialize a SnpArray.

#### Intitialize from Plink file set

SnpArray can be initialized from the Plink bed file. The corresponding `.fam` needs to be present, which is used to determine the number of individuals.

In [None]:
const mouse = SnpArray(SnpArrays.datadir("mouse.bed"))

The virtual size of the GWAS data is 1940 observations at each of 10150 SNP positions.

In [None]:
size(mouse)

Because the file is memory-mapped opening the file and accessing the data is fast, even for very large .bed files.

In [None]:
@btime(SnpArray(SnpArrays.datadir("mouse.bed")));

By default, the memory-mapped file is read only, changing entries is not allowed.

In [None]:
mouse[1, 1] = 0x00

To possibly change genoytpes in a bed file, open with write permission
```julia
mouse = SnpArray(SnpArrays.datadir("mouse.bed"), "w")
```

#### Initialize from only bed file

If only the bed file is present, user is required to supply the number of individuals in the second argument.

In [None]:
SnpArray(SnpArrays.datadir("mouse.bed"), 1940)

#### Initialize from compressed Plink files

SnpArray can be initialized from Plink files in compressed formats: `gz`, `zlib`, `zz`, `xz`, `zst`, or `bz2`. For a complete list type
```julia
SnpArrays.ALLOWED_FORMAT
```
If you want to support a new compressed format, file an issue.

Let us first compress the mouse data in gz format. We see gz format takes less than 1/3 storage of original Plink files.

In [None]:
compress_plink(SnpArrays.datadir("mouse"), "gz")
readdir(glob"mouse.*.gz", datapath)

To initialize SnpArray from gzipped Plink file, simply used the bed file with name ending with `.bed.gz`:

In [None]:
# requires corresponding `.fam.gz` file
SnpArray(SnpArrays.datadir("mouse.bed.gz"))

or

In [None]:
# does not require corresponding `.fam.gz` file
SnpArray(SnpArrays.datadir("mouse.bed.gz"), 1940)

In [None]:
# clean up
rm(SnpArrays.datadir("mouse.bed.gz"), force=true)
rm(SnpArrays.datadir("mouse.fam.gz"), force=true)
rm(SnpArrays.datadir("mouse.bim.gz"), force=true)

#### Initialize and create bed file

Initialize 5 rows and 3 columns with all (A1, A1) genotype (0x00) and memory-map to a bed file `tmp.bed`

In [None]:
tmpbf = SnpArray("tmp.bed", 5, 3)

Change entries

In [None]:
tmpbf[1:2, 1:2] .= 0x03
tmpbf

In [None]:
fill!(tmpbf, 0x02)
tmpbf

In [None]:
# clean up
rm("tmp.bed", force=true)

Initialize 5 rows and 3 columns with undefined genotypes without memory-mapping to any file

In [None]:
tmpbf = SnpArray(undef, 5, 3)

Create a bed file corresponding to an existing SnpArray and memory-map it.

In [None]:
tmpbf = SnpArray("tmp.bed", tmpbf)

In [None]:
tmpbf[1, 1] = 0x02
tmpbf

In [None]:
# clean up
rm("tmp.bed", force=true)

### `convert` and `copyto!`

Most common usage of SnpArray is to convert genotypes to numeric values for statistical analysis. Conversion rule depends on genetic models (additive, dominant, or recessive), centering, scaling, or imputation.

#### `convert`

`convert` function has 4 keyword arguments: `model`, `center`, `scale`, and `impute`.

`model` keyword specifies the SNP model for conversion. By default `convert` function translates genotypes according to the *additive* SNP model, which essentially counts the number of **A2** allele (0, 1 or 2) per genotype. Other SNP models are *dominant* and *recessive*, both in terms of the **A2** allele.

| Genotype | `SnpArray` | `model=ADDITIVE_MODEL` | `model=DOMINANT_MODEL` | `model=RECESSIVE_MODEL` |    
|:---:|:---:|:---:|:---:|:---:|  
| A1,A1 | 0x00 | 0 | 0 | 0 |  
| missing | 0x01 | NaN | NaN | NaN |
| A1,A2 | 0x02 | 1 | 1 | 0 |  
| A2,A2 | 0x03 | 2 | 1 | 1 |  

`center=true` tells `convert` to center each column by its mean. Default is `false`.

`scale=true` tells `convert` to scale each column by its standard deviation. Default is `false`.

`impute=true` tells `convert` to impute missing genotypes (0x01) by column mean. Default is `false`.

Convert whole SnpArray to a Float64 matrix using defaults (`model=ADDITIVE_MODEL`, `center=false`, `scale=false`, `impute=false`)

In [None]:
convert(Matrix{Float64}, mouse)

!!! note  

    When `convert` or `copyto!` a slice or subarray of SnpArray, using `view`, `@view` or `views` is necessary for both correctness and efficiency. Without view, it's simply converting the UInt8 coding in original bed file.
    

Convert a column to Float64 vector using defaults (`model=ADDITIVE_MODEL`, `center=false`, `scale=false`, `impute=false`).

In [None]:
# convert(Vector{Float64}, view(mouse, :, 1)) # alternative syntax
# @views convert(Vector{Float64}, mouse[:, 1]) # alternative syntax
convert(Vector{Float64}, @view(mouse[:, 1]))

Convert a subarray of SnpArray to Float64 matrix using defaults (`model=ADDITIVE_MODEL`, `center=false`, `scale=false`, `impute=false`).

In [None]:
convert(Matrix{Float64}, @view(mouse[1:2:10, 1:2:10]))

Different SNP models (`ADDITIVE_MODEL` vs `DOMINANT_MODEL` vs `RECESSIVE_MODEL`)

In [None]:
@views [convert(Vector{Float64}, mouse[:, 1], model=ADDITIVE_MODEL) convert(Vector{Float64}, mouse[:, 1], model=DOMINANT_MODEL) convert(Vector{Float64}, mouse[:, 1], model=RECESSIVE_MODEL)]

Center and scale (last column) while `convert`

In [None]:
convert(Vector{Float64}, @view(mouse[:, end]), center=true, scale=true)

Center, scale, and impute (last column) while `convert`

In [None]:
convert(Vector{Float64}, @view(mouse[:, end]), center=true, scale=true, impute=true)

#### `copyto!`

`copyto!` is the in-place version of `convert`. It takes the same keyword arguments (`model`, `center`, `scale`, `impute`) as `convert`.

Copy a column to a Float64 vector using defaults (`model=:additive`, `center=false`, `scale=false`, `impute=false`).

In [None]:
v = zeros(size(mouse, 1))
copyto!(v, @view(mouse[:, 1]))

In [None]:
@btime(copyto!($v, $@view(mouse[:, 1])));

Copy columns using defaults

In [None]:
v2 = zeros(size(mouse, 1), 2)
copyto!(v2, @view(mouse[:, 1:2]))

In [None]:
# roughly double the cost of copying 1 column
@btime(copyto!($v2, $@view(mouse[:, 1:2])));

Center and scale

In [None]:
copyto!(v, @view(mouse[:, 1]), center=true, scale=true)

In [None]:
# more cost becoz of extra pass for center, scale, and/or impute
@btime(copyto!($v, $(@view(mouse[:, 1])), center=true, scale=true));

Looping over all columns

In [None]:
v = Vector{Float64}(undef, size(mouse, 1))
function loop_test(v, s)
    for j in 1:size(s, 2)
        copyto!(v, @view(s[:, j]))
    end
end
@btime(loop_test($v, $mouse))

Copy whole SnpArray

In [None]:
M = similar(mouse, Float64)
@btime(copyto!($M, $mouse));

#### Impute missing genotypes using ADMIXTURE estimates

`convert` and `copyto!` can perform more fine-tuned imputation using the ancestry estimates from the [ADMIXTURE](https://github.com/OpenMendel/ADMIXTURE.jl) software.

Step 1: Calculate the ancestry estimate and allele frequencies using ADMIXTURE.jl. Here we assume $K=3$ populations.

In [None]:
# install ADMIXTURE package first 
using ADMIXTURE
if isfile("mouse.3.P") && isfile("mouse.3.Q")
    P = readdlm("mouse.3.P", ' ', Float64) 
    Q = readdlm("mouse.3.Q", ' ', Float64)
else
    # run ADMIXTURE using 4 threads
    P, Q = admixture(SnpArrays.datadir("mouse.bed"), 3, j=4)
end;

**Step 2**: Impute using ancestry estimates `P` and `Q`. Note `copyto!` and `convert` assumes `P` has dimension `K x S` and `Q` has dimension `K x N` where `K` is number of populations, `S` is number of SNPs, and `N` is number of individuals. So we need to transpose the output of `admixture`.

In [None]:
Pt = P |> transpose |> Matrix
Qt = Q |> transpose |> Matrix
convert(Matrix{Float64}, mouse, Pt, Qt)

In [None]:
# takes slightly longer because of calculation involving P and Q
M = similar(mouse, Float64)
@btime(copyto!($M, $mouse, $Pt, $Qt));

### Summaries

#### Counts

Counts of each the four possible values for each column are returned by `counts`.`

In [None]:
counts(mouse, dims=1)

Column 2 has no missing values (code `0x01`, the second row in the column-counts table).
In that SNP position for this sample, 359 indivduals are homozygous allele 1 (`G` according to the `.bim` file), 1004 are heterozygous, and 577 are homozygous allele 2 (`A`).

The counts by column and by row are cached in the `SnpArray` object. Accesses after the first are extremely fast.

In [None]:
@btime(counts($mouse, dims=1));

#### Minor allele frequencies

Minor allele frequencies (MAF) for each SNP.

In [None]:
maf(mouse)

Minor allele (`false` means A1 is the minor allele; `true` means A2 is the minor allele) for each SNP.

In [None]:
minorallele(mouse)

#### `mean` and `var`

The package provides methods for the generics `mean` and `var` from the `Statistics` package.

In [None]:
mean(mouse, dims=1)

In [None]:
mean(mouse, dims=1, model=DOMINANT_MODEL)

In [None]:
var(mouse, dims=1)

These methods make use of the cached column or row counts and thus are very fast

In [None]:
@btime(mean($mouse, dims=1));

The column-wise or row-wise standard deviations are returned by `std`.

In [None]:
std(mouse, dims=2)

#### Missing rate

Proportion of missing genotypes

In [None]:
missingrate(mouse, 1)

In [None]:
missingrate(mouse, 2)

#### Location of the missing values

The positions of the missing data are evaluated by

In [None]:
mp = missingpos(mouse)

In [None]:
@btime(missingpos($mouse));

So, for example, the number of missing data values in each column can be evaluated as

In [None]:
sum(mp, dims=1)

although it is faster, but somewhat more obscure, to use

In [None]:
view(counts(mouse, dims=1), 2:2, :)

### Simulation

#### Independent SNPs
Given a fixed vector of minor allele frequencies, we can simulate SNP data from a homogenous population with high efficiency.

As an example, suppose we are interested in simulating $100,000$ SNPs from $500,000$ subjects, without missingness.

In [None]:
M, N = 500_000, 100_000
# Exact size of packed SNP data in memory
packed_size = round((M * N * sizeof(UInt8) / 4) / (1024^3); digits = 2)
# Exact size of SNP data stored as Int
unpacked_size = round(M * N * sizeof(Int) / 1024^3; digits = 2);

Simulating the data using 64-bit integers would require nearly 373 Gb of memory, well over what is available in the majority of personal computers. However, generating simulated values using the packed representation will only need a little over 12 Gb of memory, which is feasible on many machines. 

We will randomly sample minor allele frequencies in the interval $(0.01, 0.50)$.

In [None]:
rng = Random.MersenneTwister(2024)
MAFs = 0.01 .+ 0.49 .* rand(rng, N)
@timev sim = SnpArrays.simulate(rng, M, N, MAFs)

Even including compilation time, it only takes a little over 60 seconds on a consumer computer to simulate tens of thousands of SNPs from hundreds of thousands of subjects.

In [None]:
cc = counts(sim; dims = 1)
sim_means = dropdims(mean(sim; dims = 1); dims=1)

In [None]:
sqrt(mean(abs2, (sim_means / 2 - MAFs) ./ MAFs))

The RMSE in minor allele frequency estimation is only 0.003, verifying that the parameters we defined can indeed be recovered from the simulated data.

#### Missing Values

We can also simulate missing values easily, using the `simulate_missing!` function to insert missing values into our data.

First, we need a vector of length `N` defining the missingness proportion or rate at each simulated locus.

In [None]:
missing_rate = rand(rng, N);

Then, we can create missing values by:

In [None]:
@timev sim = SnpArrays.simulate_missing!(rng, sim, missing_rate)

Simulating missing values takes about as long as simulating the initial values themselves.

### Linkage Disequilibrium

We can also simulate SNPs under a hypothetical LD structure, assuming that the correlation between adjacent SNPs can be described by an AR-1 process with constant $\rho$.

Simulating SNPs under this LD structure takes about four times as long as independent sampling without LD, but this is still incredibly fast compared to other methods.

In [None]:
# Assume a correlation of 0.10 between adjacent SNPs
ρ = 0.10
@timev sim = SnpArrays.simulate(rng, M, N, MAFs, ρ)

In [None]:
sim = nothing

### Genetic relationship matrix (GRM)

#### Homogenous population

For homogenous population, `grm` function computes the empirical kinship matrix using either the classical genetic relationship matrix, `grm(A, model=:GRM)`, or the method of moment method, `grm(A, model=:MoM)`, or the robust method, `grm(A, model=:Robust)`. See the section _Kinship Comparison_ of the [manuscript](http://hua-zhou.github.io/media/pdf/Zhou19OpenMendel.pdf) for the formulae and references for these methods. 

Classical genetic relation matrix

In [None]:
# grm(mouse, method=:MoM)
# grm(mouse, method=:Robust)
g = grm(mouse, method=:GRM)

In [None]:
@btime(grm($mouse, method=:GRM));

Using Float32 (single precision) potentially saves memory usage and computation time.

In [None]:
grm(mouse, method=:GRM, t=Float32)

In [None]:
@btime(grm($mouse, method=:GRM, t=Float32));

By default, `grm` exlcudes SNPs with minor allele frequency below 0.01. This can be changed by the keyword argument `minmaf`.

In [None]:
# compute GRM excluding SNPs with MAF≤0.05 
grm(mouse, minmaf=0.05)

To specify specific SNPs for calculating empirical kinship, use the `cinds` keyword (default is `nothing`). When `cinds` is specified, `minmaf` is ignored.

In [None]:
# GRM using every other SNP
grm(mouse, cinds=1:2:size(mouse, 2))

#### Inhomogenous/admixed populations

For inhomogenous/admixed population, we recommend first estimating the ancestry and pupulation allele frequencies using the ADMIXTURE software. See [ADMIXTURE.jl](https://github.com/OpenMendel/ADMIXTURE.jl) for usage. Then compute the kinship coefficients using the `P` (allele frequencies) and `Q` (ancestry fractions) matrix from the output of ADMIXTURE. This is essentially what the [REAP software](http://faculty.washington.edu/tathornt/software/REAP) does, except our implementation runs much faster than REAP (>50 fold speedup). 

In [None]:
# first read in the P and Q matrix output from ADMIXTURE and transpose them
Pt = readdlm("mouse.3.P", ' ', Float64) |> transpose |> Matrix
Qt = readdlm("mouse.3.Q", ' ', Float64) |> transpose |> Matrix;

In [None]:
SnpArrays.grm_admixture(mouse, Pt, Qt)

In [None]:
# clean up
rm("mouse.3.P", force = true)
rm("mouse.3.Q", force = true)

### Filtering

Before GWAS, we often need to filter SNPs and/or samples according to genotyping success rates, minor allele frequencies, and Hardy-Weinberg Equilibrium test. This can be achieved by the `filter` function.

```@docs
SnpArrays.filter
```

By default, it outputs row and column index vectors such that sample-wise and SNP-wise genotyping success rate are at least 0.98 and minor allele frequencies are at least 0.01. User can opt to filter according to Hardy-Weinberg test by setting the minumum p-value `min_hwe_pval`.

In [None]:
rowmask, colmask =  SnpArrays.filter(mouse)

In [None]:
count(rowmask), count(colmask)

In [None]:
@btime(SnpArrays.filter($mouse, min_success_rate_per_row=0.999, min_success_rate_per_col=0.999));

One may use the `rowmask` and `colmask` to filter and save filtering result as Plink files.
```julia
SnpArrays.filter(SnpArrays.datadir("mouse"), rowmask, colmask)
```

#### Filter Plink files

Filter a set of Plink files according to row indices and column indices. By result, filtered Plink files are saved as `srcname.filtered.bed`, `srcname.filtered.fam`, and `srcname.filtered.bim`, where `srcname` is the source Plink file name. You can also specify destimation file name using keyword `des`.

In [None]:
SnpArrays.filter(SnpArrays.datadir("mouse"), 1:5, 1:5)

In [None]:
# clean up
rm(SnpArrays.datadir("mouse.filtered.bed"), force=true)
rm(SnpArrays.datadir("mouse.filtered.fam"), force=true)
rm(SnpArrays.datadir("mouse.filtered.bim"), force=true)

Filter a set of Plink files according to logical vectors.

In [None]:
SnpArrays.filter(SnpArrays.datadir("mouse"), rowmask, colmask)

In [None]:
readdir(glob"mouse.filtered.*", datapath)

In [None]:
# clean up
rm(SnpArrays.datadir("mouse.filtered.bed"), force=true)
rm(SnpArrays.datadir("mouse.filtered.fam"), force=true)
rm(SnpArrays.datadir("mouse.filtered.bim"), force=true)

### Concatenating `SnpArray`s

Concatenation of `SnpArray`s is implemented in `hcat`, `vcat`, and `hvcat` functions. By default, the resulting `.bed` file is saved as a file beginning with `tmp_` in the working directory. You can specify destination using keyword `des`. 

For concatenation, `SnpArray` arguments do not deal with `.fam` or `.bim` files at all. You can use `SnpData` as the arguments to create those files (see below).

In [None]:
s = SnpArrays.filter(SnpArrays.datadir("mouse"), 1:2, 1:3)
s

In [None]:
all(s .== [[0x02 0x02 0x02];
[0x02 0x02 0x03]])

Standard concatenation works just like any other arrays. However, a temporary file is created as a side effect.

In [None]:
[s s s]

In [None]:
[s; s; s]

In [None]:
[s s s; s s s]

In [None]:
readdir(glob"tmp_*", ".")

In order to set the destination `.bed` file, you can add the keyword argument `des`.

In [None]:
hcat(s, s, s; des=SnpArrays.datadir("mouse.test.hcat"))

In [None]:
vcat(s, s, s; des=SnpArrays.datadir("mouse.test.vcat"))

In [None]:
hvcat((3, 3), s, s, s, s, s, s; des=SnpArrays.datadir("mouse.test.hvcat"))

In [None]:
# clean up
rm(SnpArrays.datadir("mouse.filtered.bed"), force=true)
rm(SnpArrays.datadir("mouse.filtered.fam"), force=true)
rm(SnpArrays.datadir("mouse.filtered.bim"), force=true)
tmplist = readdir(glob"tmp_*.bed", ".")
for f in tmplist
    rm(f, force=true)
end
rm(SnpArrays.datadir("mouse.test.hcat.bed"), force=true)
rm(SnpArrays.datadir("mouse.test.vcat.bed"), force=true)
rm(SnpArrays.datadir("mouse.test.hvcat.bed"), force=true)

## Linear Algebra

In some applications we want to perform linear algebra using SnpArray directly without expanding it to numeric matrix. This is achieved in three different `struct`s:

1. Direct operations on a plink-formatted `SnpArray`: `SnpLinAlg`
2. Operations on transformed `BitMatrix`es: `SnpBitMatrix`
3. Direct operations on a plink-formatted data on an Nvidia GPU: `CuSnpArray`.

`SnpLinAlg` and `SnpBitMatrix` use Chris Elrod's [LoopVectorization.jl](https://github.com/chriselrod/LoopVectorization.jl) internally. It is much faster on machines with AVX support. `CuSnpArray` uses [CUDA.jl](https://juliagpu.gitlab.io/CUDA.jl/) internally.

!!! warning "deprecated SnpBitMatrix"
    `SnpBitMatrix` is now deprecated in favor of `SnpLinAlg`. 
    `SnpBitMatrix` will be removed on next minor release.
    
The implementation assumes that the matrix corresponding to SnpArray is the matrix of the A2 allele counts. `SnpLinAlg` and `CuSnpArray` impute any missing genotype with its column mean by default. They can also configured to impute missing genotypes with zero. `SnpBitMatrix` can only impute missing values with zero. 

### Constructor

First let's load a data set without missing genotypes.

In [None]:
const EUR = SnpArray(SnpArrays.datadir("EUR_subset.bed"))

To instantiate a SnpLinAlg based on SnpArray,

In [None]:
const EURsla = SnpLinAlg{Float64}(EUR, model=ADDITIVE_MODEL, center=true, scale=true)
const EURsla_ = SnpLinAlg{Float64}(EUR, model=ADDITIVE_MODEL, center=true, scale=true, impute=false)

The constructor shares the same keyword arguments as the `convert` or `copyto!` functions. The type parameter, `Float64` in this example, indicates the SnpLinAlg acts like a Float64 matrix.
SnpLinAlg directly uses the SnpArray for computation and does not expand into full numeric array. 

In [None]:
Base.summarysize(EUR), Base.summarysize(EURsla)

### `mul!`

SnpLinAlg act similar to a regular matrix and responds to `size`, `eltype`, SnpLinAlg-vector multiplication, and SnpLinAlg-matrix multiplications. Other linear algebra operations (e.g. `qr()`) should work on a SnpLinAlg, but will be much slower. 

In [None]:
@show size(EURsla)
@show eltype(EURsla)
@show typeof(EURsla) <: AbstractMatrix;

Matrix-vector and matrix-matrix multiplications with SnpLinAlg are mathematically equivalent to the corresponding Float matrix contained from `convert` or `copyto!` a SnpArray.

In [None]:
using LinearAlgebra
v1 = randn(size(EUR, 1))
v2 = randn(size(EUR, 2))
A = convert(Matrix{Float64}, EUR, model=ADDITIVE_MODEL, center=true, scale=true);

In [None]:
norm(EURsla * v2 - A * v2)

In [None]:
norm(EURsla' * v1 - A' * v1)

### Linear Algebra Performance

See Linear Algebra Benchmarks on the left for performance comparison among BLAS, SnpLinAlg, and CuSnpArray (for GPU). In general,
+ SnpLinAlg-vector multiplications are at least 2x faster than the corresponding Matrix{Float64}-vector multiplication using BLAS
+ CuSnpArray-vector multiplications on the GPU is 50x faster than BLAS, and
+ SnpLinAlg-matrix multiplication is competitive with BLAS if the right hand matrix is "tall and thin".

Note that SnpLinAlg does not allocate additional memory, and can impute missing values with column means. 

### `copyto!`, `convert`, and `subarrays`

`copyto!` and `convert` are also supported on `SnpLinAlg`s, but without the `impute`, `scale`, `center` keyword arguments. The destination array will be scaled/centered if the `SnpLinAlg` was scaled/centered. 

In [None]:
# convert work on SnpLinAlg (and subarrays of it)
Atrue = convert(Matrix{Float64}, EUR, center=true, scale=true, impute=true)
A = convert(Matrix{Float64}, EURsla)
all(Atrue .≈ A)

In [None]:
# copyto on a subarray
v = zeros(size(EUR, 1), 10)
copyto!(v, @view(EURsla[:, 1:2:20]))

In [None]:
all(v .≈ Atrue[:, 1:2:20])

### GPU support: CuSnpArray (optional)

On machines with Nvidia GPU, matrix-vector multiplications can be performed on it via CuSnpArray. The input vectors should be CuVectors. 

In [None]:
using CUDA, Adapt
out1 = randn(size(EUR, 1))
out2 = randn(size(EUR, 2))
v1 = randn(size(EUR, 1))
v2 = randn(size(EUR, 2))
v1_d = adapt(CuVector{Float64}, v1) # sends data to GPU
v2_d = adapt(CuVector{Float64}, v2)
out1_d = adapt(CuVector{Float64}, out1)
out2_d = adapt(CuVector{Float64}, out2)

const EURcu = CuSnpArray{Float64}(EUR; model=ADDITIVE_MODEL, center=true, scale=true);

In [None]:
@btime mul!($out1_d, $EURcu, $v2_d);

In [None]:
@btime mul!($out2_d, transpose($EURcu), $v1_d);

The operations are parallelized along the output dimension, hence the GPU was not fully utilized in the first case. With 100-time larger data, 30 to 50-fold speedup were observed for both cases with Nvidia Titan V. See linear algebra page for more information.

Let's check correctness of the result.

## SnpData

We can create a `SnpData`, which has a `SnpArray` with information on SNP and subject appended.

### Constructor

In [None]:
EUR_data = SnpData(SnpArrays.datadir("EUR_subset"))

### Filter

We can filter SnpData by functions `f_person` and `f_snp`. `f_person` applies to the field `person_info` and selects persons (rows) for which `f_person` is `true`.`f_snp` applies to the field `snp_info` and selects snps (columns) for which `f_snp` is `true`. The first argument can be either a `SnpData` or an `AbstractString`.

In [None]:
SnpArrays.filter(EUR_data; des="tmp.filter.chr.17", f_snp = x -> x[:chromosome]=="17")

In [None]:
SnpArrays.filter(SnpArrays.datadir("EUR_subset"); des="tmp.filter.chr.17", f_snp = x -> x[:chromosome]=="17")

In [None]:
SnpArrays.filter(EUR_data; des="tmp.filter.sex.male", f_person = x -> x[:sex] == "1")

Both `f_person` and `f_snp` can be used at the same time.

In [None]:
SnpArrays.filter(EUR_data; des="tmp.filter.chr.17.sex.male", f_person = x -> x[:sex] == "1", f_snp = x -> x[:chromosome] == "17")

### Split

We can split `SnpData` by SNP's choromosomes or each person's sex or phenotype using `split_plink`. Again, the first argument can be an `SnpData` or an `AbstractString`.

In [None]:
splitted = SnpArrays.split_plink(SnpArrays.datadir("EUR_subset"), :chromosome; prefix="tmp.split.chr.")

Let's take a SnpArray for chromosome 17.

In [None]:
piece = splitted["17"]

In [None]:
@assert all(piece.snp_info[!, :chromosome].== "17")

In [None]:
splitted_sex = SnpArrays.split_plink(EUR_data, :sex; prefix="tmp.split.sex.")

### Concatenation

`hcat`, `vcat`, and `hvcat` are also implemented for `SnpData`. All of `.bed`, `.bim`, `.fam` files are created. Simple concatenation expression can be used (with the side effect of creation of temporary plink files). One may also set the desitination using the keyword argument `des`. 

In [None]:
[piece piece]

In [None]:
[piece; piece]

In [None]:
[piece piece; piece piece]

In [None]:
hcat(piece, piece; des="tmp.hcat")

In [None]:
vcat(piece, piece; des="tmp.vcat")

In [None]:
hvcat((2,2), piece, piece, piece, piece; des="tmp.hvcat")

### Merge

We can merge the splitted dictionary back into one SnpData using `merge_plink`.

In [None]:
merged = SnpArrays.merge_plink("tmp.merged", splitted) # write_plink is included here

You can also merge the plink formatted files based on their common prefix.

In [None]:
merged_from_splitted_files = merge_plink("tmp.split.chr"; des = "tmp.merged.2")

### Reorder

Order of subjects can be changed using the function `reorder!`.

In [None]:
const mouse_prefix = SnpArrays.datadir("mouse")
run(`cp $(mouse_prefix * ".bed") mouse_reorder.bed`)
run(`cp $(mouse_prefix * ".bim") mouse_reorder.bim`)
run(`cp $(mouse_prefix * ".fam") mouse_reorder.fam`)

In [None]:
mouse_data = SnpData(mouse_prefix)
mouse_toreorder = SnpData("mouse_reorder", "r+")
m, n = size(mouse_toreorder.snparray)

For example, the below randomly permutes subjects.

In [None]:
using Random
ind = randperm(m)
SnpArrays.reorder!(mouse_toreorder, ind)

In [None]:
mouse_toreorder

This functionality mainly targets Cox regression, where sorting subjects in decreasing order of (censored) survival time results in more efficient implementation.

## VCF to PLINK

SnpArrays.jl includes a function to transform a (gzipped) VCF file to PLINK-formatted files. This function drops multi-allelic variants and variants with missing identifier.

In [None]:
# Download an example VCF file
isfile("test.08Jun17.d8b.vcf.gz") || download("http://faculty.washington.edu/browning/beagle/test.08Jun17.d8b.vcf.gz", 
    joinpath(pwd(), "test.08Jun17.d8b.vcf.gz"));

In [None]:
vcf2plink("test.08Jun17.d8b.vcf.gz", "test.08Jun17.d8b")

In [None]:
# clean up
for ft in ["bim", "fam", "bed"]
    rm("tmp.filter.chr.17." * ft, force=true)
    rm("tmp.filter.sex.male." * ft, force=true)
    rm("tmp.filter.chr.17.sex.male." * ft, force=true)
    for k in keys(splitted)
        rm("tmp.split.chr.$(k)." * ft, force=true)
    end
    for k in keys(splitted_sex)
        rm("tmp.split.sex.$(k)." * ft, force=true)
    end
    rm("tmp.merged." * ft, force=true)
    rm("tmp.merged.2." * ft, force=true)
    
    rm("tmp.hcat." * ft, force=true)
    rm("tmp.vcat." * ft, force=true)
    rm("tmp.hvcat." * ft, force=true)

    tmplist = glob("tmp_*" * ft)
    for f in tmplist
        rm(f, force=true)
    end
end
tmplist = readdir(glob"tmp_*.bed", ".")
for f in tmplist
    rm(f, force=true)
end
rm("mouse_reorder.bim", force=true)
rm("mouse_reorder.bed", force=true)
rm("mouse_reorder.fam", force=true)
rm("mouse_reorder.reordered.fam", force=true)
rm("test.08Jun17.d8b.vcf.gz", force=true)
rm("test.08Jun17.d8b.bed", force=true)
rm("test.08Jun17.d8b.bim", force=true)
rm("test.08Jun17.d8b.fam", force=true)