# SnpArrays.jl

The module `SnpArrays` implements the `SnpArray` type for handling biallelic genotypes. `SnpArray` is an array of `Tuple{Bool,Bool}` and adopts the same coding as the [Plink binary format](http://pngu.mgh.harvard.edu/~purcell/plink/binary.shtml). If `A1` and `A2` are the two alleles, the coding rule is  

| Genotype | SnpArray |  
|---|---|---|  
| A1,A1 | 00 |  
| A1,A2 | 01 |  
| A2,A2 | 11 |  
| missing | 10 |  
The code `10==(true,false)` is reserved for missing genotype. Otherwise, the bit `1==true` represents one copy of allele `A2`. In a two-dimensional `SnpArray`, each row is a person and each column is a SNP.

## Constructor

There are various ways to initialize a `SnpArray`.  

* `SnpArray` can be initialized from [Plink binary files](http://pngu.mgh.harvard.edu/~purcell/plink/binary.shtml), say the sample data set hapmap3:

In [1]:
;ls -al hapmap3.*

-rw-r--r--@ 1 hzhou3  staff  1128171 Jun  4  2010 hapmap3.bed
-rw-r--r--  1 hzhou3  staff   388672 Jun  4  2010 hapmap3.bim
-rw-r--r--@ 1 hzhou3  staff     7136 Jun  4  2010 hapmap3.fam
-rw-r--r--@ 1 hzhou3  staff   332960 Jun  4  2010 hapmap3.map


In [2]:
using SnpArrays
hapmap = SnpArray("hapmap3")

324x13928 SnpArrays.SnpArray{2}:
 (true,true)  (true,true)   (false,false)  …  (true,true)   (true,true)
 (true,true)  (false,true)  (false,true)      (false,true)  (true,true)
 (true,true)  (true,true)   (false,true)      (true,true)   (true,true)
 (true,true)  (true,true)   (false,true)      (true,true)   (true,true)
 (true,true)  (true,true)   (false,true)      (true,true)   (true,true)
 (true,true)  (false,true)  (true,true)    …  (true,true)   (true,true)
 (true,true)  (true,true)   (true,true)       (true,true)   (true,true)
 (true,true)  (true,true)   (false,false)     (true,true)   (true,true)
 (true,true)  (true,true)   (false,true)      (true,true)   (true,true)
 (true,true)  (true,true)   (false,true)      (true,true)   (true,true)
 (true,true)  (true,true)   (false,true)   …  (true,true)   (true,true)
 (true,true)  (true,true)   (true,true)       (true,true)   (true,true)
 (true,true)  (true,true)   (false,false)     (true,true)   (true,true)
 ⋮                             

In [3]:
# rows are people; columns are SNPs
people, snps = size(hapmap)

(324,13928)

Internally `SnpArray` stores data as `BitArray`s and consumes approximately the same amount of memory as the Plink `bed` file size.

In [4]:
# memory usage
Base.summarysize(hapmap)

1128256

* `SnpArray` can be initialized from a matrix of A1 allele counts.

In [5]:
SnpArray(rand(0:2, 5, 3))

5x3 SnpArrays.SnpArray{2}:
 (false,true)   (false,false)  (false,false)
 (true,true)    (false,true)   (false,false)
 (false,true)   (false,false)  (false,true) 
 (false,false)  (true,true)    (false,false)
 (false,false)  (false,false)  (true,true)  

* `SnpArray(m, n)` generates an m by n `SnpArray` of all A1 alleles.

In [6]:
s = SnpArray(5, 3)

5x3 SnpArrays.SnpArray{2}:
 (false,false)  (false,false)  (false,false)
 (false,false)  (false,false)  (false,false)
 (false,false)  (false,false)  (false,false)
 (false,false)  (false,false)  (false,false)
 (false,false)  (false,false)  (false,false)

## Summary statistics

`summarize()` computes  
* `maf`: minor allele frequencies, taking into account of missingness.  
* `minor_allele`: a `BitVector` indicating the minor allele for each SNP. `minor_allele[j]==true` means A1 is the minor allele for SNP j; `minor_allele[j]==false` means A2 is the minor allele for SNP j.  
* `missings_by_snp`: number of missing genotypes for each snp.  
* `missings_by_person`: number of missing genotypes for each person.

In [7]:
maf, minor_allele, missings_by_snp, missings_by_person = summarize(hapmap)
# minor allele frequencies
maf'

1x13928 Array{Float64,2}:
 0.0  0.0776398  0.324074  0.191589  …  0.00154321  0.0417957  0.00617284

In [8]:
# total number of missing genotypes
sum(missings_by_snp), sum(missings_by_person)

(11894,11894)

## Random genotypes

`randgeno(a1freq)` generates a random genotype according to A1 allele frequency `a1freq`.

In [9]:
randgeno(0.5)

(true,true)

`randgeno(maf, minor_allele)` generates a random genotype according to minor allele frequency `maf` and whether the minor allele is A1 (`minor_allele==true`) or A2 (`minor_allele==false`).

In [10]:
randgeno(0.25, true)

(false,false)

`randgeno(n, maf, minor_allele)` generates a vector of random genotypes according to a common minor allele frequency `maf` and the minor allele.

In [11]:
randgeno(10, 0.25, true)

10-element SnpArrays.SnpArray{1}:
 (true,true) 
 (false,true)
 (true,true) 
 (false,true)
 (false,true)
 (false,true)
 (false,true)
 (true,true) 
 (true,true) 
 (true,true) 

`randgeno(m, n, maf, minor_allele)` generates a random $m$-by-$n$ `SnpArray` according to a vector of minor allele frequencies `maf` and a minor allele indicator vector. The lengths of both vectors should be `n`.

In [12]:
# this is a random replicate of the hapmap data
randgeno(size(hapmap), maf, minor_allele)

324x13928 SnpArrays.SnpArray{2}:
 (true,true)  (true,true)   (false,true)  …  (true,true)   (true,true) 
 (true,true)  (true,true)   (true,true)      (true,true)   (true,true) 
 (true,true)  (true,true)   (true,true)      (true,true)   (true,true) 
 (true,true)  (true,true)   (true,true)      (true,true)   (true,true) 
 (true,true)  (false,true)  (false,true)     (true,true)   (true,true) 
 (true,true)  (true,true)   (true,true)   …  (true,true)   (true,true) 
 (true,true)  (true,true)   (true,true)      (true,true)   (true,true) 
 (true,true)  (true,true)   (true,true)      (true,true)   (true,true) 
 (true,true)  (true,true)   (true,true)      (true,true)   (true,true) 
 (true,true)  (true,true)   (false,true)     (true,true)   (true,true) 
 (true,true)  (true,true)   (false,true)  …  (true,true)   (true,true) 
 (true,true)  (true,true)   (false,true)     (true,true)   (true,true) 
 (true,true)  (true,true)   (false,true)     (true,true)   (true,true) 
 ⋮                             

## Subsetting

Subsetting a `SnpArray` works the same way as subsetting any other arrays.

In [13]:
# genotypes of the 1st person
hapmap[1, :]

1x13928 SnpArrays.SnpArray{2}:
 (true,true)  (true,true)  (false,false)  …  (true,true)  (true,true)

In [14]:
# genotypes of the 5th SNP
hapmap[:, 5]

324-element SnpArrays.SnpArray{1}:
 (true,true)  
 (true,true)  
 (false,true) 
 (false,true) 
 (true,true)  
 (false,false)
 (false,false)
 (true,true)  
 (true,true)  
 (true,true)  
 (true,true)  
 (true,true)  
 (false,true) 
 ⋮            
 (false,false)
 (true,true)  
 (false,true) 
 (true,true)  
 (true,true)  
 (true,true)  
 (true,true)  
 (true,true)  
 (false,true) 
 (true,true)  
 (true,true)  
 (true,true)  

In [15]:
# subsetting both persons and SNPs
hapmap[1:5, 5:10]

5x6 SnpArrays.SnpArray{2}:
 (true,true)   (true,true)  (false,true)  …  (true,true)   (false,true)
 (true,true)   (true,true)  (true,true)      (true,true)   (false,true)
 (false,true)  (true,true)  (true,true)      (false,true)  (true,true) 
 (false,true)  (true,true)  (true,true)      (true,true)   (false,true)
 (true,true)   (true,true)  (true,true)      (true,true)   (false,true)

In [16]:
# filter out rare SNPs with MAF < 0.05
hapmap[:, maf .≥ 0.05]

324x12085 SnpArrays.SnpArray{2}:
 (true,true)   (false,false)  (true,true)   …  (false,true)  (false,true)
 (false,true)  (false,true)   (false,true)     (true,true)   (true,true) 
 (true,true)   (false,true)   (false,true)     (true,true)   (true,true) 
 (true,true)   (false,true)   (true,true)      (false,true)  (false,true)
 (true,true)   (false,true)   (false,true)     (true,true)   (true,true) 
 (false,true)  (true,true)    (true,true)   …  (false,true)  (false,true)
 (true,true)   (true,true)    (true,true)      (true,true)   (true,true) 
 (true,true)   (false,false)  (true,true)      (true,true)   (true,true) 
 (true,true)   (false,true)   (false,true)     (true,true)   (true,true) 
 (true,true)   (false,true)   (true,true)      (false,true)  (false,true)
 (true,true)   (false,true)   (true,true)   …  (true,true)   (true,true) 
 (true,true)   (true,true)    (false,true)     (false,true)  (false,true)
 (true,true)   (false,false)  (true,true)      (false,true)  (false,true)
 ⋮   

In [17]:
# filter out individuals with genotyping success rate < 0.90
hapmap[missings_by_person / people .< 0.1, :]

220x13928 SnpArrays.SnpArray{2}:
 (true,true)  (true,true)   (false,false)  …  (true,true)   (true,true)
 (true,true)  (false,true)  (false,true)      (false,true)  (true,true)
 (true,true)  (true,true)   (false,true)      (true,true)   (true,true)
 (true,true)  (true,true)   (true,true)       (true,true)   (true,true)
 (true,true)  (true,true)   (false,false)     (true,true)   (true,true)
 (true,true)  (true,true)   (false,true)   …  (true,true)   (true,true)
 (true,true)  (true,true)   (false,true)      (true,true)   (true,true)
 (true,true)  (true,true)   (false,true)      (true,true)   (true,true)
 (true,true)  (true,true)   (false,false)     (true,true)   (true,true)
 (true,true)  (true,true)   (false,true)      (true,true)   (true,true)
 (true,true)  (false,true)  (false,true)   …  (true,true)   (true,true)
 (true,true)  (true,true)   (true,true)       (true,true)   (true,true)
 (true,true)  (true,true)   (true,true)       (true,true)   (true,true)
 ⋮                             

`sub()` and `slice()` create views of subarray without copying data and improve efficiency in many calculations.

In [18]:
mafcommon, = summarize(sub(hapmap, :, maf .≥ 0.05))
mafcommon'

1x12085 Array{Float64,2}:
 0.0776398  0.324074  0.191589  …  0.310937  0.23913  0.23913  0.23913

## Assignment

It is possible to assign specific genotypes to a `SnpArray` entry.

In [19]:
hapmap[1, 1]

(true,true)

In [20]:
hapmap[1, 1] = (false, true)
hapmap[1, 1]

(false,true)

In [21]:
hapmap[1, 1] = NaN
hapmap[1, 1]

(true,false)

In [22]:
hapmap[1, 1] = 2
hapmap[1, 1]

(true,true)

Subsetted assignment such as `map4[:, 1] = NaN` is also valid.

## Copy, convert, and imputation

In most analyses we convert a whole `SnpArray` or slices of it to numeric arrays for computational purposes. Keep in mind that the storage of resultant data can be up to 32 fold larger than that of the original `SnpArray`. Fortunately, rich collection of data types in `Julia` can allow us choose one that fits into memory. Below are estimates of memory usage for some common data types with `n` persons and `p` SNPs. Here MAF denotes the **average** minor allele frequencies.

* `SnpArray`: $0.25np$ bytes  
* `Matrix{Int8}`: $np$ bytes  
* `Matrix{Float16}`: $2np$ bytes  
* `Matrix{Float32}`: $4np$ bytes  
* `Matrix{Float64}`: $8np$ bytes  
* `SparseMatrixCSC{Float64,Int64}`: $16 \cdot \text{NNZ} + 8(p+1) \approx 16np(2\text{MAF}(1-\text{MAF})+\text{MAF}^2) + 8(p+1) = 16np \cdot \text{MAF}(2-\text{MAF}) + 8(p+1)$ bytes. When the average MAF=0.25, this is about $7np$ bytes. When MAF=0.025, this is about $0.8np$ bypes, 10 fold smaller than the `Matrix{Float64}` type.  
* `SparseMatrixCSC{Int8,UInt32}`: $5 \cdot \text{NNZ} + 4(p+1) \approx 5np(2\text{MAF}(1-\text{MAF})+\text{MAF}^2) + 4(p+1) = 5np \cdot \text{MAF}(2-\text{MAF}) + 4(p+1)$ bytes. When the average MAF=0.25, this is about $2.2np$ bytes. When MAF=0.08, this is about $0.8np$ bypes, 10 fold smaller than `Matrix{Float64}` type.  
* Two `SparseMatrixCSC{Bool,Int64}`: $2np \cdot \text{MAF} \cdot 9 + 16(p+1) = 18 np \cdot \text{MAF} + 16(p+1)$ bytes. When the average MAF=0.25, this is about $4.5np$ bytes. When MAF=0.045, this is about $0.8np$ bytes, 10 fold smaller than `Matrix{Float64}` type.  

To be concrete, consider 2 typical data sets:  
* COPD (GWAS): $n = 6670$ individuals, $p = 630998$ SNPs, average MAF is 0.2454.
* GAW19 (sequencing study): $n = 959$ individuals, $p = 8348674$ SNPs, average MAF is 0.085.  

| Data Type | COPD | GAW19 |  
|---|---|---|  
| `SnpArray` | 1.05GB | 2GB |  
| `Matrix{Float64}` | 33.67GB | 64.05GB |  
| `SparseMatrixCSC{Float64,Int64}` | 29GB | 20.82GB |  
| `SparseMatrixCSC{Bool,Int64}` | 18.6GB | 12.386GB |  

Apparently for data sets with a majority of rare variants, converting to sparse matrices saves memory and often brings computational advantages too. In the `SparseMatrixCSC` format, the integer type of the row indices `rowval` and column pointer `colptr` should have maximal larger than the number of nonzeros in the matrix. The `InexactError()` error encountered during conversion often indicates that the integer type has a too small range. The utility function `estimatesize()` conveniently estimates memory usage in bytes for the input data type.

In [23]:
# estimated memory usage if convert to Matrix{Float64}
estimatesize(people, snps, mean(maf), Matrix{Float64})

3.6101376e7

In [24]:
# convert to Matrix{Float64}
hapmapf64 = convert(Matrix{Float64}, hapmap)

324x13928 Array{Float64,2}:
 0.0  0.0  2.0  0.0  0.0  0.0  1.0  1.0  …  1.0  1.0  1.0  1.0  0.0  0.0  0.0
 0.0  1.0  1.0  1.0  0.0  0.0  0.0  1.0     0.0  0.0  0.0  0.0  0.0  1.0  0.0
 0.0  0.0  1.0  1.0  1.0  0.0  0.0  2.0     1.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  1.0  0.0  1.0  0.0  0.0  1.0     1.0  1.0  1.0  1.0  0.0  0.0  0.0
 0.0  0.0  1.0  1.0  0.0  0.0  0.0  2.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0  2.0  0.0  0.0  0.0  …  2.0  1.0  1.0  1.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  2.0  0.0  0.0  2.0     1.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  2.0  0.0  0.0  0.0  0.0  1.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  1.0  1.0  0.0  0.0  0.0  2.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  1.0  0.0  0.0  0.0  0.0  2.0     1.0  1.0  1.0  1.0  0.0  0.0  0.0
 0.0  0.0  1.0  0.0  0.0  0.0  0.0  2.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  1.0  0.0  0.0  0.0  2.0     1.0  1.0  1.0  1.0  0.0  0.0  0.0
 0.0  0.0  2.0  0.0  1.0  0.0  0.0  

In [25]:
# actual memory usage of Matrix{Float64}
Base.summarysize(hapmapf64)

36101376

In [26]:
# average maf of the hapmap3 data set
mean(maf)

0.222585591341583

In [27]:
# estimated memory usage if convert to SparseMatrixCSC{Float32, UInt32} matrix
estimatesize(people, snps, mean(maf), SparseMatrixCSC{Float32, UInt32})

1.4338389205819245e7

In [28]:
# convert to SparseMatrixCSC{Float32, UInt32} matrix
hapmapf32sp = convert(SparseMatrixCSC{Float32, UInt32}, hapmap)

324x13928 sparse matrix with 1614876 Float32 entries:
	[2    ,     2]  =  1.0
	[6    ,     2]  =  1.0
	[15   ,     2]  =  1.0
	[31   ,     2]  =  1.0
	[33   ,     2]  =  1.0
	[35   ,     2]  =  1.0
	[43   ,     2]  =  1.0
	[44   ,     2]  =  1.0
	[50   ,     2]  =  1.0
	[54   ,     2]  =  1.0
	⋮
	[135  , 13927]  =  1.0
	[148  , 13927]  =  1.0
	[160  , 13927]  =  1.0
	[164  , 13927]  =  2.0
	[167  , 13927]  =  1.0
	[185  , 13927]  =  1.0
	[266  , 13927]  =  1.0
	[280  , 13927]  =  1.0
	[288  , 13927]  =  1.0
	[118  , 13928]  =  2.0
	[231  , 13928]  =  2.0

In [29]:
# actual memory usage if convert to SparseMatrixCSC{Float32, UInt32} matrix
Base.summarysize(hapmapf32sp)

12974764

By default the `convert()` method converts missing genotypes to `NaN`.

In [30]:
# number of missing genotypes
countnz(isnan(hapmap)), countnz(isnan(hapmapf64))

(11894,11894)

One can enforce *crude imputation* by setting the optional argument `impute=true`. Imputation is done by generating two random alleles according to the minor allele frequency. This is not an optimal strategy, and users should impute missing genotypes by more advanced methods.

In [31]:
hapmapf64impute = convert(Matrix{Float64}, hapmap; impute = true)
countnz(isnan(hapmapf64impute))

0

By default `convert()` translates genotypes according to the *additive* SNP model, which essentially counts the number of **minor allele** (0, 1 or 2) per genotype. Other SNP models are *dominant* and *recessive*, both in terms of the **minor allele**. When `A1` is the minor allele, genotypes are translated to real number according to

| Genotype | `SnpArray` | `model=:additive` | `model=:dominant` | `model=:recessive` |    
|---|---|---|---|---|  
| A1,A1 | 00 | 2 | 1 | 1 |  
| A1,A2 | 01 | 1 | 1 | 0 |  
| A2,A2 | 11 | 0 | 0 | 0 |  
| missing | 10 | NaN | NaN | NaN | 

When `A2` is the minor allele, genotypes are translated according to

| Genotype | `SnpArray` | `model=:additive` | `model=:dominant` | `model=:recessive` |    
|---|---|---|---|---|  
| A1,A1 | 00 | 0 | 0 | 0 |  
| A1,A2 | 01 | 1 | 1 | 0 |  
| A2,A2 | 11 | 2 | 1 | 1 |  
| missing | 10 | NaN | NaN | NaN |

In [32]:
[convert(Vector{Float64}, hapmap[1:10, 5]; model = :additive) convert(Vector{Float64}, hapmap[1:10, 5]; model = :dominant) convert(Vector{Float64}, hapmap[1:10, 5]; model = :recessive)]

10x3 Array{Float64,2}:
 0.0  0.0  0.0
 0.0  0.0  0.0
 1.0  1.0  0.0
 1.0  1.0  0.0
 0.0  0.0  0.0
 2.0  1.0  1.0
 2.0  1.0  1.0
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0

By default `convert()` does **not** center and scale genotypes. Setting the optional arguments `center=true, scale=true` centers genotypes at 2MAF and scales them by $[2 \cdot \text{MAF} \cdot (1 - \text{MAF})]^{-1/2}$. Mono-allelic SNPs (MAF=0) are not scaled.

In [33]:
[convert(Vector{Float64}, hapmap[:, 5]) convert(Vector{Float64}, hapmap[:, 5]; center = true, scale = true)]

324x2 Array{Float64,2}:
 0.0  -1.25702 
 0.0  -1.25702 
 1.0   0.167017
 1.0   0.167017
 0.0  -1.25702 
 2.0   1.59106 
 2.0   1.59106 
 0.0  -1.25702 
 0.0  -1.25702 
 0.0  -1.25702 
 0.0  -1.25702 
 0.0  -1.25702 
 1.0   0.167017
 ⋮             
 2.0   1.59106 
 0.0  -1.25702 
 1.0   0.167017
 0.0  -1.25702 
 0.0  -1.25702 
 0.0  -1.25702 
 0.0  -1.25702 
 0.0  -1.25702 
 1.0   0.167017
 0.0  -1.25702 
 0.0  -1.25702 
 0.0  -1.25702 

`copy!()` is the in-place version of `convert()`. Options such as GWAS loop over SNPs and perform statistical anlaysis for each SNP. This can be achieved by

In [34]:
g = zeros(people)
for j = 1:snps
    copy!(g, hapmap[:, j]; model = :additive, impute = true)
    # do statistical anlaysis
end

## Empirical kinship

`grm()` computes the empirical kinship matrix using either the genetic relationship matrix (`model=:GRM`, default) or the method of moment method (`model=:MoM`). Missing genotypes are imputed according to minor allele frequencies on the fly.

In [35]:
# GRM using all SNPs
grm(hapmap)

324x324 Array{Float64,2}:
 0.56673    0.0443687  0.0196643  …  0.0624451  0.068939   0.0620598
 0.0443687  0.530712   0.0318809     0.0493016  0.043351   0.0606032
 0.0196643  0.0318809  0.512099      0.0454608  0.0290651  0.0347894
 0.0463286  0.0353972  0.027901      0.0576465  0.0631885  0.0574523
 0.0505602  0.0410582  0.024834      0.0694772  0.0562678  0.0631605
 0.0427703  0.0308963  0.0372747  …  0.0677988  0.0545417  0.0629865
 0.0379952  0.021542   0.0117349     0.0429154  0.0367599  0.0357789
 0.039454   0.0371082  0.021367      0.0555552  0.0526382  0.0636472
 0.0287846  0.0295993  0.0165778     0.0322943  0.0443417  0.0362813
 0.0375132  0.0409033  0.0245705     0.0641036  0.0553565  0.0470998
 0.0462326  0.0438224  0.0229768  …  0.0563888  0.0644813  0.0584756
 0.057857   0.0374183  0.0357086     0.0640191  0.0565152  0.069355 
 0.0344281  0.0424677  0.0257591     0.0550316  0.0674265  0.0612805
 ⋮                                ⋱                                 
 0.06343

In [36]:
# GRM using every other SNP
grm(sub(hapmap, :, 1:2:snps))

324x324 Array{Float64,2}:
 0.555646   0.0414084  0.0262989  …  0.0647587  0.0706874  0.0648252
 0.0414084  0.544936   0.0355526     0.0561237  0.0440001  0.0538899
 0.0262989  0.0355526  0.501723      0.0384994  0.0382852  0.0459045
 0.0432794  0.0444736  0.0250285     0.0494052  0.0590337  0.0542243
 0.0502537  0.0460495  0.02562       0.0645735  0.0545508  0.0626624
 0.0502448  0.0392753  0.0372825  …  0.0740264  0.059921   0.0509181
 0.0378877  0.025382   0.0167928     0.0452522  0.0332789  0.0327589
 0.0459086  0.0369229  0.0253351     0.0569635  0.053617   0.0667119
 0.0256212  0.0226543  0.0188275     0.0318428  0.0417911  0.0343988
 0.0304497  0.0380977  0.0234851     0.0576673  0.0397379  0.048951 
 0.047065   0.04792    0.0199609  …  0.060362   0.0631994  0.0516364
 0.0594915  0.0468341  0.0395426     0.063652   0.0615488  0.0618787
 0.0240586  0.0414677  0.0209235     0.0530332  0.0663219  0.0611431
 ⋮                                ⋱                                 
 0.05814

In [37]:
# MoM using all SNPs
grm(hapmap; method = :MoM)

324x324 Array{Float64,2}:
 0.539332    0.0344978  0.00214116  …  0.0535102  0.0635479  0.0506761
 0.0344978   0.517957   0.0133597      0.0420555  0.0394575  0.0500856
 0.00214116  0.0133597  0.498827       0.0327264  0.0212717  0.0198546
 0.0430002   0.0289475  0.0231611      0.0522112  0.0690981  0.049259 
 0.0448897   0.0333169  0.0154853      0.0655554  0.057171   0.0561082
 0.0320179   0.0210355  0.0271762   …  0.0596509  0.0500856  0.0486685
 0.0241059   0.0108798  0.00355824     0.0314274  0.0309551  0.0241059
 0.0259953   0.0290656  0.00922655     0.0389852  0.0467791  0.0471334
 0.0209174   0.0248144  0.0139501      0.0298923  0.0487866  0.0298923
 0.021626    0.025641   0.0105255      0.053274   0.0506761  0.032136 
 0.036151    0.033435   0.00580194  …  0.0496133  0.0561082  0.0497313
 0.0481962   0.0322541  0.0267038      0.0553997  0.0556358  0.0519751
 0.0315455   0.0420555  0.0249325      0.0589424  0.0712237  0.0579976
 ⋮                                  ⋱              

## Principal component analysis 

Principal compoenent analysis is widely used in genome-wide association analysis (GWAS) for adjusting population substructure. `pca(A, pcs)` computes the top `pcs` principal components based on the `SnpArray` `A`. Each SNP is centered at $2\text{MAF}$ and scaled by $[2\text{MAF}(1-\text{MAF})]^{-1/2}$. The output is  

* `pcscore`: top `pcs` eigen-SNPs, or principal scores, in each column  
* `pcloading`: top `pcs` eigen-vectors, or principal loadings, in each column  
* `pcvariance`: top `pcs` eigen-values, or principal variances

Missing genotypes are imputed according the minor allele frequencies on the fly. This implies that, in the presence of missing genotypes, running the function on the same `SnpArray` twice may produce slightly different answers. For reproducibility, it is a good practice to set the random seed before each function that does imputation on the fly.

In [38]:
srand(123) # set seed
pcscore, pcloading, pcvariance = pca(hapmap, 3)

(
324x3 Array{Float64,2}:
 -38.7231  -1.2983     -7.00541  
 -32.6096  -1.21052    -3.3232   
 -23.0215  -0.505397   12.1751   
 -35.692   -2.76103    -2.40055  
 -37.1815  -0.132498   -3.66829  
 -34.9285  -1.11368     6.14167  
 -22.0323  -5.70536     2.02968  
 -30.9994  -2.28269    -0.0893283
 -22.8432  -3.76024     7.97486  
 -32.2024  -0.239253    2.91168  
 -36.344   -0.773184   -5.31525  
 -35.886   -0.807234    0.279053 
 -33.9423  -3.78982     7.35677  
   ⋮                             
 -49.1282   0.913683   10.4061   
 -46.9862  -0.9654     -0.435579 
 -48.5334  -1.05076    -0.15223  
 -49.0331   0.379279    5.65431  
 -47.8714  -0.406195   -7.14605  
 -48.2028  -1.41369    -0.564107 
 -46.7128  -3.36643    -4.44341  
 -48.9006  -1.69293     0.0467995
 -48.5574   1.34936    -1.89814  
 -50.2291   0.0865293  -1.94494  
 -48.9263  -2.06102     2.17374  
 -48.8627   0.274894    6.49518  ,

13928x3 Array{Float64,2}:
  9.66817e-20   7.35949e-19   5.79015e-19
  0.00143962   -0.00

To use eigen-SNPs for plotting or covariates in GWAS, we typically scale them by their standard deviations so that they have mean zero and unit variance.

In [39]:
# standardize eigen-SNPs before plotting or GWAS
scale!(pcscore, 1.0 ./ √(pcvariance))
std(pcscore, 1)

1x3 Array{Float64,2}:
 1.0  1.0  1.0

Internally `pca()` converts `SnpArray` to the matrix of minor allele counts. The default format is `Matrix{Float64}`, which can easily exceed memory limit. Users have several options when the default `Matrix{Float64}` cannot fit into memory.  

* Use other intermediate matrix types.

In [40]:
# use single precision matrix and display the principal variances
# approximately same answer as double precision
srand(123)
pca(hapmap, 3, Matrix{Float32})[3]

3-element Array{Float32,1}:
 1841.39  
  225.324 
   70.7085

* Use subset of SNPs

In [41]:
# principal components using every other SNP capture about half the variance
srand(123)
pca(sub(hapmap, :, 1:2:snps), 3)[3]

3-element Array{Float64,1}:
 926.622 
 113.188 
  36.4866

* Use sparse matrix. For large data sets with majority of rare variants, `pca_sp()` is more efficient by first converting `SnpArray` to a sparse matrix (default is `SparseMatrixCSC{Float64, Int64}`) and then computing principal components using iterative algorithms. 

In [42]:
# approximately same answer if we use Float16 sparse matrix
srand(123)
pca_sp(hapmap, 3, SparseMatrixCSC{Float16, UInt32})[3]

3-element Array{Float64,1}:
 1841.4   
  225.31  
   70.7094

In [43]:
# approximately same answer if we use Int8 sparse matrix
srand(123)
pca_sp(hapmap, 3, SparseMatrixCSC{Int8, UInt32})[3]

3-element Array{Float64,1}:
 1841.4   
  225.328 
   70.7119

In [44]:
versioninfo()

Julia Version 0.4.5
Commit 2ac304d (2016-03-18 00:58 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3
