# SnpArrays.jl

The module `SnpArrays` implements the `SnpArray` type for handling biallelic genotypes. `SnpArray` is an array of `Tuple{Bool,Bool}`. If `A` is the major allele and `a` the minor allele, the coding rule is  

| Genotype | Plink | SnpArray |  
|---|---|---|  
| AA | 11 | 00 |  
| Aa | 01 | 10 |  
| aa | 00 | 11 |  
| missing | 10 | 01 |  
`SnpArray` inverts the bits in the Plink binary representation. Roughly speaking, `true` represents a minor allele. But code `01 = (false,true)` is reserved for missing genotype.

## Initialization

 `SnpArray` can be initialized from [Plink binary files](http://pngu.mgh.harvard.edu/~purcell/plink/binary.shtml), say an example data set of the MAP4 gene on chromosome 3:

In [1]:
;ls -al chr3-map4-geno.*

-rw-r--r--  1 hzhou3  staff  215043 Jul 16  2014 chr3-map4-geno.bed
-rw-r--r--  1 hzhou3  staff   25088 Jul 16  2014 chr3-map4-geno.bim
-rw-r--r--  1 hzhou3  staff   39356 Jul 16  2014 chr3-map4-geno.fam
-rw-r--r--  1 hzhou3  staff    2321 Jul 16  2014 chr3-map4-geno.log


In [2]:
include("../src/SnpArrays.jl")
using SnpArrays
map4 = SnpArray("chr3-map4-geno")

959x896 SnpArrays.SnpArray{2}:
 (false,false)  (false,false)  (false,false)  …  (true,false)   (false,false)
 (true,false)   (false,false)  (false,false)     (true,false)   (false,false)
 (false,false)  (false,false)  (false,false)     (false,false)  (false,false)
 (true,false)   (false,false)  (false,false)     (true,false)   (false,false)
 (false,false)  (false,false)  (false,false)     (false,false)  (false,false)
 (false,false)  (false,false)  (false,false)  …  (false,false)  (false,false)
 (true,false)   (false,false)  (false,false)     (true,true)    (false,false)
 (false,false)  (false,false)  (false,false)     (true,false)   (false,false)
 (false,false)  (false,false)  (true,false)      (true,false)   (false,false)
 (false,false)  (false,false)  (false,false)     (false,false)  (false,false)
 (false,false)  (false,false)  (false,false)  …  (false,false)  (false,false)
 (false,false)  (false,false)  (false,false)     (false,false)  (false,false)
 (true,false)   (false,false)  (f

In [3]:
size(map4)

(959,896)

Internally `SnpArray` stores two `BitArray`s `A1` and `A2`. For $n$ individuals and $p$ SNPs, the memory usage of `SnpArray` is $0.5np$ bytes, approximately same size as the Plink `bed` file.

In [4]:
# fields of a SnpArray
fieldnames(map4)

2-element Array{Symbol,1}:
 :A1
 :A2

In [5]:
# memory usage
Base.summarysize(map4)

214896

Alternatively we can initialize `SnpArray` from a matrix of minor allele counts.

In [6]:
SnpArray(rand(0:2, 5, 3))

5x3 SnpArrays.SnpArray{2}:
 (false,false)  (false,false)  (true,true)  
 (false,false)  (true,true)    (true,true)  
 (true,true)    (true,true)    (true,false) 
 (true,true)    (false,false)  (true,true)  
 (true,true)    (false,false)  (false,false)

## Summary statistics

`summarysnps()` computes the number of minor alleles, number of missing genotypes, and minor allele frequencies (MAF) along each row and column. Calculation of MAF takes into account of missingness.

In [7]:
nmialcol, nmisscol, mafcol, nmialrow, nmissrow, mafrow = summarysnps(map4)
mafcol

896-element Array{Float64,1}:
 0.226799  
 0.00208551
 0.0260688 
 0.0135558 
 0.00104275
 0.397704  
 0.0182482 
 0.00312826
 0.00417537
 0.00834202
 0.0630865 
 0.00104275
 0.00573514
 ⋮         
 0.00104275
 0.00156413
 0.00104384
 0.00104275
 0.00521921
 0.00782065
 0.00208551
 0.00208551
 0.00208551
 0.00260688
 0.382046  
 0.00260688

In [8]:
# total number of missing genotypes
sum(nmisscol), sum(nmissrow)

(218,218)

## Subsetting

Subsetting a `SnpArray` works the same way as subsetting any other arrays.

In [9]:
# genotypes of the 1st individual
map4[1, :]

1x896 SnpArrays.SnpArray{2}:
 (false,false)  (false,false)  (false,false)  …  (true,false)  (false,false)

In [10]:
# genotypes of the 5th SNP
map4[:, 5]

959-element SnpArrays.SnpArray{1}:
 (false,false)
 (false,false)
 (false,false)
 (false,false)
 (false,false)
 (false,false)
 (false,false)
 (false,false)
 (false,false)
 (false,false)
 (false,false)
 (false,false)
 (false,false)
 ⋮            
 (false,false)
 (false,false)
 (false,false)
 (false,false)
 (false,false)
 (false,false)
 (false,false)
 (false,false)
 (false,false)
 (false,false)
 (false,false)
 (false,false)

In [11]:
# subsetting both individuals and SNPs
map4[1:5, 5:10]

5x6 SnpArrays.SnpArray{2}:
 (false,false)  (true,false)   (false,false)  …  (false,false)  (false,false)
 (false,false)  (true,false)   (false,false)     (false,false)  (false,false)
 (false,false)  (false,false)  (false,false)     (false,false)  (false,false)
 (false,false)  (true,false)   (false,false)     (false,false)  (false,false)
 (false,false)  (false,false)  (false,false)     (false,false)  (false,false)

In [12]:
# filter out rare SNPs with MAF < 0.05
map4[:, mafcol .>= 0.05]

959x150 SnpArrays.SnpArray{2}:
 (false,false)  (true,false)   (false,false)  …  (true,false)   (true,false) 
 (true,false)   (true,false)   (true,false)      (true,false)   (true,false) 
 (false,false)  (false,false)  (false,false)     (false,false)  (false,false)
 (true,false)   (true,false)   (false,false)     (true,false)   (true,false) 
 (false,false)  (false,false)  (true,false)      (false,false)  (false,false)
 (false,false)  (false,false)  (false,false)  …  (false,false)  (false,false)
 (true,false)   (true,true)    (false,false)     (true,true)    (true,true)  
 (false,false)  (false,false)  (false,false)     (false,false)  (true,false) 
 (false,false)  (true,true)    (false,false)     (true,true)    (true,false) 
 (false,false)  (true,false)   (false,false)     (true,false)   (false,false)
 (false,false)  (false,false)  (false,false)  …  (false,false)  (false,false)
 (false,false)  (false,false)  (false,false)     (false,false)  (false,false)
 (true,false)   (true,false)   (f

In [13]:
# filter out individuals with genotyping success rate < 0.999
map4[nmissrow / size(map4, 2) .< 0.001, :]

817x896 SnpArrays.SnpArray{2}:
 (false,false)  (false,false)  (false,false)  …  (true,false)   (false,false)
 (true,false)   (false,false)  (false,false)     (true,false)   (false,false)
 (false,false)  (false,false)  (false,false)     (false,false)  (false,false)
 (true,false)   (false,false)  (false,false)     (true,false)   (false,false)
 (false,false)  (false,false)  (false,false)     (false,false)  (false,false)
 (false,false)  (false,false)  (false,false)  …  (false,false)  (false,false)
 (true,false)   (false,false)  (false,false)     (true,true)    (false,false)
 (false,false)  (false,false)  (false,false)     (true,false)   (false,false)
 (false,false)  (false,false)  (true,false)      (true,false)   (false,false)
 (false,false)  (false,false)  (false,false)     (false,false)  (false,false)
 (false,false)  (false,false)  (false,false)  …  (false,false)  (false,false)
 (true,false)   (false,false)  (false,false)     (true,false)   (false,false)
 (true,true)    (false,false)  (f

`sub()` and `slice()` create views of subarray without copying data and improve efficiency in many calculations.

In [14]:
_, _, mafcommon = summarysnps(sub(map4, :, mafcol .>= 0.05))
mafcommon

150-element Array{Float64,1}:
 0.226799 
 0.397704 
 0.0630865
 0.206465 
 0.155892 
 0.226799 
 0.37122  
 0.38634  
 0.156054 
 0.122523 
 0.123566 
 0.0866388
 0.123045 
 ⋮        
 0.390167 
 0.390511 
 0.409424 
 0.354015 
 0.411366 
 0.366771 
 0.390511 
 0.391032 
 0.395725 
 0.354015 
 0.390511 
 0.382046 

## Assignment

It is possible to assign specific genotypes to a `SnpArray` entry.

In [15]:
map4[1, 1]

(false,false)

In [16]:
map4[1, 1] = (true, false)
map4[1, 1]

(true,false)

In [17]:
map4[1, 1] = NaN
map4[1, 1]

(false,true)

In [18]:
map4[1, 1] = 0
map4[1, 1]

(false,false)

Subsetted assignment such as `map4[:, 1] = NaN` is also valid.

## Copy, convert, and imputation

In some cases we convert whole `SnpArray` to numeric arrays for computations such as PCA and Lasso. Keep in mind the storage of resultant data can be up to 32 fold that of `SnpArray`. Below are estimates of storage for some common data types. Here MAF denotes the average minor allele frequencies.

* `SnpArray`: $0.25np$ bytes  
* `Matrix{Int8}`: $np$ bytes  
* `Matrix{Float32}`: $4np$ bytes  
* `Matrix{Float64}`: $8np$ bytes  
* `SparseMatrixCSC{Float64, Int64}`: $24 \cdot NNZ \approx 24np(2MAF(1-MAF)+MAF^2) = 24np \cdot MAF(2-MAF)$ bytes. When MAF=0.25, it is $9np$ bytes. When MAF=0.017, it is $0.8np$ bypes, 10 fold smaller than `Matrix{Float64}` type.  
* `SparseMatrixCSC{Bool, Int64}`: $2np \cdot MAF \cdot 17 = 34 np \cdot MAF$ bytes. When average MAF=0.25, it is $8.5np$ bytes. When MAF=0.0235, it is $0.8np$ bytes, 10 fold smaller than `Matrix{Float64}` type.  

To be concrete, consider 2 typical data sets:  
* COPD (GWAS): $n = 6670$ individuals, $p = 630998$ SNPs, average MAF is 0.2454.
* GAW19 (sequencing study): $n = 1943$ individuals, $p = 1711766$ SNPs, average MAF is 0.00499.  

| Data Type | COPD | GAW19 |  
|---|---|---|  
| `SnpArray` | 1.05GB | 0.83GB |  
| `Matrix{Float64}` | 33.67GB | 26.61GB |  
| `SparseMatrixCSC{Float64, Int64}` | 43.5GB | 0.795GB |  
| `SparseMatrixCSC{Bool, Int64}` | 35.1GB |0.564GB |  

Apparently for data sets with majority of rare variants, converting to sparse matrices saves memory usage and often brings computational advantages too.

In [19]:
map4f64 = convert(Matrix{Float64}, map4)
# memory usage if convert to Float64
Base.summarysize(map4f64)

6874112

In [20]:
@show mean(mafcol)
map4f64sp = convert(SparseMatrixCSC{Float64, Int64}, map4)
# memory usage if convert to sparse Float64 matrix
Base.summarysize(map4f64sp)

mean(mafcol) = 0.059234694204908254


1390144

By default `convert()` method does **not** impute missing genotypes and convert them to `NaN`.

In [21]:
# number of missing genotypes
countnz(isnan(map4)), countnz(isnan(map4f64))

(218,218)

We can enforce imputation by setting optional argument `impute = true`. Imputation is done by generating two random alleles according to the minor allele frequency. This is not an optimal strategy and users should make sure genotypes are imputed with high quality using other advanced methods.

In [22]:
map4f64impute = convert(Matrix{Float64}, map4; impute = true)
countnz(isnan(map4f64impute))

0

By default `convert()` translates genotypes according to the *additive* SNP model, which essentially is the minor allele counts (0, 1 or 2). Other SNP models are *dominant* and *recessive*.

| Genotype | `SnpArray` | `model=:additive` | `model=:dominant` | `model=:recessive` |    
|---|---|---|---|---|  
| AA | 00 | 0 | 0 | 0 |  
| Aa | 10 | 1 | 0 | 2 |
| aa | 11 | 2 | 2 | 2 |
| missing | 01 | NaN | NaN | NaN | 

In [23]:
[convert(Vector{Float64}, map4[1:10, 1]; model = :additive) convert(Vector{Float64}, map4[1:10, 1]; model = :dominant) convert(Vector{Float64}, map4[1:10, 1]; model = :recessive)]

10x3 Array{Float64,2}:
 0.0  0.0  0.0
 1.0  0.0  2.0
 0.0  0.0  0.0
 1.0  0.0  2.0
 0.0  0.0  0.0
 0.0  0.0  0.0
 1.0  0.0  2.0
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0

By default `convert()` does **not** center and scale genotypes. Setting optional arguments `center = true, scale = true` centers genotypes at 2MAF and scales them by $[2 \cdot MAF \cdot (1 - MAF)]^{-1/2}$. Mono-allelic SNPs (MAF=0) are not centered or scaled.

In [24]:
[convert(Vector{Float64}, map4[:, 1]) convert(Vector{Float64}, map4[:, 1]; center = true, scale = true)]

959x2 Array{Float64,2}:
 0.0  -0.76593 
 1.0   0.922637
 0.0  -0.76593 
 1.0   0.922637
 0.0  -0.76593 
 0.0  -0.76593 
 1.0   0.922637
 0.0  -0.76593 
 0.0  -0.76593 
 0.0  -0.76593 
 0.0  -0.76593 
 0.0  -0.76593 
 1.0   0.922637
 ⋮             
 0.0  -0.76593 
 0.0  -0.76593 
 0.0  -0.76593 
 0.0  -0.76593 
 1.0   0.922637
 1.0   0.922637
 1.0   0.922637
 1.0   0.922637
 1.0   0.922637
 1.0   0.922637
 0.0  -0.76593 
 0.0  -0.76593 

`copy!()` is the in-place version of `convert()`. Options such as GWAS loop over SNPs and perform statistical anlaysis for each SNP. This can be achieved by

In [25]:
g = zeros(size(map4, 1))
for j = 1:size(map4, 2)
    copy!(g, map4[:, j]; model = :additive, impute = true)
    # do statistical anlaysis
end

## Empirical kinship

`kinship()` computes the empirical kinship matrix using either the genetic relationship matrix (`model=:GRM`, default) or the method of moment method (`model=:MoM`). Missing genotypes are imputed according to minor allele frequencies on the fly.

In [26]:
# GRM using all SNPs
kinship(map4)

959x959 Array{Float64,2}:
  0.380544    -0.001171     -0.0288245  …  -0.028186    0.0197438  
 -0.001171     0.0910801    -0.0226232     -0.0126247  -0.000119002
 -0.0288245   -0.0226232     0.641305       0.0814241  -0.0255233  
 -0.0050335    0.0150372    -0.0215443     -0.025145   -0.0039815  
 -0.024474    -0.022623      0.073076       0.142048   -0.0285776  
 -0.024404    -0.0256074     0.076275   …   0.0715203  -0.0285076  
  0.0453891    0.0370275    -0.130117      -0.128325    0.0418238  
 -0.0218192   -0.0230977     0.0714131      0.0720706  -0.0243318  
  0.032136    -0.00637036   -0.124575      -0.122782    0.016663   
 -0.00733452  -0.0426735    -0.0116094     -0.0109709  -0.0417872  
 -0.0311827   -0.0220789     0.089408   …   0.080414   -0.0266451  
 -0.0300756   -0.0226379     0.0839075      0.081521   -0.025538   
 -0.00867467   0.0113961    -0.0251854     -0.0287862  -0.00762267 
  ⋮                                     ⋱                          
 -0.0215018   -0.01965

In [27]:
# MoM using all SNPs
kinship(map4; method = :MoM)

959x959 Array{Float64,2}:
  0.140001     -0.00682793   0.0211395    …   0.0281314     0.0281314 
 -0.00682793    0.119026     0.0630907        0.0910581    -0.0138198 
  0.0211395     0.0630907    0.92309          0.874147      0.0281314 
 -0.0417872     0.0560988    0.049107         0.0421151    -0.0487791 
  0.0211395     0.0281314    0.804228         0.853171     -0.00682793
  0.0211395     0.0281314    0.81122      …   0.797236     -0.00682793
 -0.00682793   -0.00682793  -0.775933        -0.761949     -0.0417872 
  0.0630907     0.0630907    0.81122          0.818212      0.0351232 
 -0.237559     -0.321462    -0.831867        -0.817884     -0.27951   
 -0.293494     -0.342437     0.0281314        0.0351232    -0.363413  
  0.000163934   0.0560988    0.88813      …   0.860163      0.0141477 
  0.00715579    0.0560988    0.867155         0.867155      0.0211395 
 -0.0837384     0.0141477    0.00715579       0.000163934  -0.0907303 
  ⋮                                       ⋱        

In [28]:
# GRM using every other 3 SNP
kinship(sub(map4, :, 1:2:size(map4, 2)))

959x959 Array{Float64,2}:
  0.07881      -0.000853301  -0.0235223  …  -0.0231723    0.0184407  
 -0.000853301   0.0354295    -0.0250093     -0.00734379  -0.000482591
 -0.0235223    -0.0250093     0.656679       0.0801822   -0.0209578  
 -0.00124231    0.0177249    -0.0253983     -0.0250484   -0.000871595
 -0.0204739    -0.0210335     0.0720974      0.123584    -0.0264231  
 -0.0174959    -0.0274966     0.0705212  …   0.0685633   -0.0234451  
  0.0395097     0.0410407    -0.129707      -0.127049     0.0376866  
 -0.0157941    -0.0219088     0.0735532      0.0715953   -0.0178573  
  0.0320641    -0.0128681    -0.129525      -0.126867     0.00642591 
 -0.0098777    -0.0480998    -0.0183572     -0.0180072   -0.0440483  
 -0.0189452    -0.0226958     0.0799405  …   0.0779826   -0.0186442  
 -0.0201318    -0.0238824     0.0787539      0.076796    -0.0198309  
 -0.0035757     0.0153915    -0.0277317     -0.0273818   -0.00320499 
  ⋮                                      ⋱                      

## PCA

In progress ...

In [29]:
versioninfo()

Julia Version 0.4.5
Commit 2ac304d (2016-03-18 00:58 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3
