# SnpArrays.jl

The module `SnpArrays` implements the `SnpArray` type for handling biallelic genotypes. `SnpArray` is an array of `Tuple{Bool,Bool}` and adopts the same coding as the [Plink binary format](http://pngu.mgh.harvard.edu/~purcell/plink/binary.shtml). If `A1` and `A2` are two alleles, the coding rule is  

| Genotype | SnpArray |  
|---|---|---|  
| A1,A1 | 00 |  
| A1,A2 | 01 |  
| A2,A2 | 11 |  
| missing | 10 |  
The code `10==(true,false)` is reserved for missing genotype. Otherwise the bit `1==true` represents one copy of allele `A2`. For a two-dimensional `SnpArray`, it assumes each row is a person and each column is a SNP.

## Constructor

There are various ways to initialize a `SnpArray`.  

* `SnpArray` can be initialized from [Plink binary files](http://pngu.mgh.harvard.edu/~purcell/plink/binary.shtml), say an example data set hapmap3:

In [1]:
;ls -al hapmap3.*

-rw-r--r--@ 1 hzhou3  staff  1128171 Jun  4  2010 hapmap3.bed
-rw-r--r--@ 1 hzhou3  staff   388672 Jun  4  2010 hapmap3.bim
-rw-r--r--@ 1 hzhou3  staff     7136 Jun  4  2010 hapmap3.fam
-rw-r--r--@ 1 hzhou3  staff   332960 Jun  4  2010 hapmap3.map


In [2]:
include("../src/SnpArrays.jl")
using SnpArrays
hapmap = SnpArray("hapmap3")

324x13928 SnpArrays.SnpArray{2}:
 (true,true)  (true,true)   (false,false)  …  (true,true)   (true,true)
 (true,true)  (false,true)  (false,true)      (false,true)  (true,true)
 (true,true)  (true,true)   (false,true)      (true,true)   (true,true)
 (true,true)  (true,true)   (false,true)      (true,true)   (true,true)
 (true,true)  (true,true)   (false,true)      (true,true)   (true,true)
 (true,true)  (false,true)  (true,true)    …  (true,true)   (true,true)
 (true,true)  (true,true)   (true,true)       (true,true)   (true,true)
 (true,true)  (true,true)   (false,false)     (true,true)   (true,true)
 (true,true)  (true,true)   (false,true)      (true,true)   (true,true)
 (true,true)  (true,true)   (false,true)      (true,true)   (true,true)
 (true,true)  (true,true)   (false,true)   …  (true,true)   (true,true)
 (true,true)  (true,true)   (true,true)       (true,true)   (true,true)
 (true,true)  (true,true)   (false,false)     (true,true)   (true,true)
 ⋮                             

In [3]:
# rows are persons; columns are SNPs
people, snps = size(hapmap)

(324,13928)

Internally `SnpArray` stores data as `BitArray`s and its memrory consumption is about same as the Plink `bed` file size.

In [4]:
# memory usage
Base.summarysize(hapmap)

1128256

* `SnpArray` can be initialized from a matrix of A1 allele counts.

In [5]:
SnpArray(rand(0:2, 5, 3))

5x3 SnpArrays.SnpArray{2}:
 (false,false)  (false,false)  (true,true)  
 (false,true)   (false,true)   (false,false)
 (true,true)    (false,false)  (false,false)
 (false,false)  (false,false)  (false,false)
 (false,false)  (true,true)    (false,true) 

* `SnpArray(m, n)` generates an m by n `SnpArray` of all A1 alleles.

In [6]:
s = SnpArray(5, 3)

5x3 SnpArrays.SnpArray{2}:
 (false,false)  (false,false)  (false,false)
 (false,false)  (false,false)  (false,false)
 (false,false)  (false,false)  (false,false)
 (false,false)  (false,false)  (false,false)
 (false,false)  (false,false)  (false,false)

## Summary statistics

`summarize()` computes  
* `maf`: minor allele frequencies, taking into account of missingness.  
* `minor_allele`: a BitVector indicating the minor allele for each SNP. `minor_allele[j]==true` means A1 is the minor allele for SNP j; `minor_allele[j]==false` means A2 is the minor allele for SNP j.  
* `missings_by_snp`: number of missing genotypes for each snp.  
* `missings_by_person`: number of missing genotypes for each person.

In [7]:
maf, minor_allele, missings_by_snp, missings_by_person = summarize(hapmap)
# minor allele frequencies
maf'

1x13928 Array{Float64,2}:
 0.0  0.0776398  0.324074  0.191589  …  0.00154321  0.0417957  0.00617284

In [8]:
# total number of missing genotypes
sum(missings_by_snp), sum(missings_by_person)

(11894,11894)

## Random genotypes

`randgeno(a1freq)` generates a random genotype according to A1 allele frequency `a1freq`.

In [9]:
randgeno(0.5)

(false,false)

`randgeno(maf, minor_allele)` generates a random genotype according to minor allele frequency `maf` and whether the minor allele is A1 (`minor_allele==true`) or A2 (`minor_allele==false`).

In [10]:
randgeno(0.25, true)

(true,true)

`randgeno(n, maf, minor_allele)` generates a vector of random genotypes according to a common minor allele frequency `maf` and the minor allele.

In [11]:
randgeno(10, 0.25, true)

10-element SnpArrays.SnpArray{1}:
 (false,true)
 (false,true)
 (true,true) 
 (true,true) 
 (true,true) 
 (false,true)
 (true,true) 
 (true,true) 
 (true,true) 
 (true,true) 

`randgeno(m, n, maf, minor_allele)` generates a random $m$-by-$n$ `SnpArray` according to a vector of minor allele frequencies `maf` and a minor allele indicator vector. Lengths of both vectors should be $n$.

In [12]:
# this is a random replicate of the hapmap data
randgeno(size(hapmap), maf, minor_allele)

324x13928 SnpArrays.SnpArray{2}:
 (true,true)  (true,true)   (true,true)    …  (true,true)   (true,true) 
 (true,true)  (false,true)  (false,true)      (true,true)   (true,true) 
 (true,true)  (true,true)   (false,true)      (true,true)   (true,true) 
 (true,true)  (false,true)  (false,true)      (true,true)   (true,true) 
 (true,true)  (false,true)  (false,true)      (true,true)   (true,true) 
 (true,true)  (true,true)   (true,true)    …  (true,true)   (true,true) 
 (true,true)  (false,true)  (true,true)       (true,true)   (true,true) 
 (true,true)  (true,true)   (true,true)       (true,true)   (true,true) 
 (true,true)  (true,true)   (false,true)      (true,true)   (true,true) 
 (true,true)  (true,true)   (true,true)       (true,true)   (true,true) 
 (true,true)  (true,true)   (false,true)   …  (true,true)   (true,true) 
 (true,true)  (true,true)   (false,true)      (true,true)   (true,true) 
 (true,true)  (true,true)   (false,false)     (true,true)   (true,true) 
 ⋮                

## Subsetting

Subsetting a `SnpArray` works the same way as subsetting any other arrays.

In [13]:
# genotypes of the 1st individual
hapmap[1, :]

1x13928 SnpArrays.SnpArray{2}:
 (true,true)  (true,true)  (false,false)  …  (true,true)  (true,true)

In [14]:
# genotypes of the 5th SNP
hapmap[:, 5]

324-element SnpArrays.SnpArray{1}:
 (true,true)  
 (true,true)  
 (false,true) 
 (false,true) 
 (true,true)  
 (false,false)
 (false,false)
 (true,true)  
 (true,true)  
 (true,true)  
 (true,true)  
 (true,true)  
 (false,true) 
 ⋮            
 (false,false)
 (true,true)  
 (false,true) 
 (true,true)  
 (true,true)  
 (true,true)  
 (true,true)  
 (true,true)  
 (false,true) 
 (true,true)  
 (true,true)  
 (true,true)  

In [15]:
# subsetting both individuals and SNPs
hapmap[1:5, 5:10]

5x6 SnpArrays.SnpArray{2}:
 (true,true)   (true,true)  (false,true)  …  (true,true)   (false,true)
 (true,true)   (true,true)  (true,true)      (true,true)   (false,true)
 (false,true)  (true,true)  (true,true)      (false,true)  (true,true) 
 (false,true)  (true,true)  (true,true)      (true,true)   (false,true)
 (true,true)   (true,true)  (true,true)      (true,true)   (false,true)

In [16]:
# filter out rare SNPs with MAF < 0.05
hapmap[:, maf .≥ 0.05]

324x12085 SnpArrays.SnpArray{2}:
 (true,true)   (false,false)  (true,true)   …  (false,true)  (false,true)
 (false,true)  (false,true)   (false,true)     (true,true)   (true,true) 
 (true,true)   (false,true)   (false,true)     (true,true)   (true,true) 
 (true,true)   (false,true)   (true,true)      (false,true)  (false,true)
 (true,true)   (false,true)   (false,true)     (true,true)   (true,true) 
 (false,true)  (true,true)    (true,true)   …  (false,true)  (false,true)
 (true,true)   (true,true)    (true,true)      (true,true)   (true,true) 
 (true,true)   (false,false)  (true,true)      (true,true)   (true,true) 
 (true,true)   (false,true)   (false,true)     (true,true)   (true,true) 
 (true,true)   (false,true)   (true,true)      (false,true)  (false,true)
 (true,true)   (false,true)   (true,true)   …  (true,true)   (true,true) 
 (true,true)   (true,true)    (false,true)     (false,true)  (false,true)
 (true,true)   (false,false)  (true,true)      (false,true)  (false,true)
 ⋮   

In [17]:
# filter out individuals with genotyping success rate < 0.90
hapmap[missings_by_person / people .< 0.1, :]

220x13928 SnpArrays.SnpArray{2}:
 (true,true)  (true,true)   (false,false)  …  (true,true)   (true,true)
 (true,true)  (false,true)  (false,true)      (false,true)  (true,true)
 (true,true)  (true,true)   (false,true)      (true,true)   (true,true)
 (true,true)  (true,true)   (true,true)       (true,true)   (true,true)
 (true,true)  (true,true)   (false,false)     (true,true)   (true,true)
 (true,true)  (true,true)   (false,true)   …  (true,true)   (true,true)
 (true,true)  (true,true)   (false,true)      (true,true)   (true,true)
 (true,true)  (true,true)   (false,true)      (true,true)   (true,true)
 (true,true)  (true,true)   (false,false)     (true,true)   (true,true)
 (true,true)  (true,true)   (false,true)      (true,true)   (true,true)
 (true,true)  (false,true)  (false,true)   …  (true,true)   (true,true)
 (true,true)  (true,true)   (true,true)       (true,true)   (true,true)
 (true,true)  (true,true)   (true,true)       (true,true)   (true,true)
 ⋮                             

`sub()` and `slice()` create views of subarray without copying data and improve efficiency in many calculations.

In [18]:
mafcommon, = summarize(sub(hapmap, :, maf .≥ 0.05))
mafcommon'

1x12085 Array{Float64,2}:
 0.0776398  0.324074  0.191589  …  0.310937  0.23913  0.23913  0.23913

## Assignment

It is possible to assign specific genotypes to a `SnpArray` entry.

In [19]:
hapmap[1, 1]

(true,true)

In [20]:
hapmap[1, 1] = (false, true)
hapmap[1, 1]

(false,true)

In [21]:
hapmap[1, 1] = NaN
hapmap[1, 1]

(true,false)

In [22]:
hapmap[1, 1] = 2
hapmap[1, 1]

(true,true)

Subsetted assignment such as `map4[:, 1] = NaN` is also valid.

## Copy, convert, and imputation

In most analysis we convert the whole `SnpArray` or slices of it to numeric arrays for computations. Keep in mind the storage of resultant data can be up to 32 fold that of `SnpArray`. Fortunately, rich collection of data types in `Julia` can allow us choose one that fits into memory. Below are estimates of memory usage for some common data types with $n$ individuals and $p$ SNPs. Here MAF denotes the **average** minor allele frequencies.

* `SnpArray`: $0.25np$ bytes  
* `Matrix{Int8}`: $np$ bytes  
* `Matrix{Float16}`: $2np$ bytes  
* `Matrix{Float32}`: $4np$ bytes  
* `Matrix{Float64}`: $8np$ bytes  
* `SparseMatrixCSC{Float64,Int64}`: $16 \cdot \text{NNZ} + 8(p+1) \approx 16np(2\text{MAF}(1-\text{MAF})+\text{MAF}^2) + 8(p+1) = 16np \cdot \text{MAF}(2-\text{MAF}) + 8(p+1)$ bytes. When average MAF=0.25, it is about $7np$ bytes. When MAF=0.025, it is about $0.8np$ bypes, 10 fold smaller than `Matrix{Float64}` type.  
* `SparseMatrixCSC{Int8,UInt32}`: $5 \cdot \text{NNZ} + 4(p+1) \approx 5np(2\text{MAF}(1-\text{MAF})+\text{MAF}^2) + 4(p+1) = 5np \cdot \text{MAF}(2-\text{MAF}) + 4(p+1)$ bytes. When average MAF=0.25, it is about $2.2np$ bytes. When MAF=0.08, it is about $0.8np$ bypes, 10 fold smaller than `Matrix{Float64}` type.  
* Two `SparseMatrixCSC{Bool,Int64}`: $2np \cdot \text{MAF} \cdot 9 + 16(p+1) = 18 np \cdot \text{MAF} + 16(p+1)$ bytes. When average MAF=0.25, it is about $4.5np$ bytes. When MAF=0.045, it is about $0.8np$ bytes, 10 fold smaller than `Matrix{Float64}` type.  

To be concrete, consider 2 typical data sets:  
* COPD (GWAS): $n = 6670$ individuals, $p = 630998$ SNPs, average MAF is 0.2454.
* GAW19 (sequencing study): $n = 1943$ individuals, $p = 1711766$ SNPs, average MAF is 0.00499.  

| Data Type | COPD | GAW19 |  
|---|---|---|  
| `SnpArray` | 1.05GB | 0.83GB |  
| `Matrix{Float64}` | 33.67GB | 26.61GB |  
| `SparseMatrixCSC{Float64,Int64}` | 29GB | 0.543GB |  
| `SparseMatrixCSC{Bool,Int64}` | 18.6GB | 0.326GB |  

Apparently for data sets with majority of rare variants, converting to sparse matrices saves memory usage and often brings computational advantages too. When choosing the integer type of the row indices `rowval` and column pointer `colptr` in the `SparseMatrixCSC` format, make sure its maximum is larger than the number of nonzeros in the matrix. The `InexactError()` error encountered during conversion often indicates that the integer type has a too small range. The utility function `estimatesize()` conveniently estimates memory usage in bytes from the input data type.

In [23]:
# estimated memory usage if convert to Matrix{Float64}
estimatesize(people, snps, mean(maf), Matrix{Float64})

3.6101376e7

In [24]:
# actual memory usage if convert to Float64
hapmapf64 = convert(Matrix{Float64}, hapmap)
Base.summarysize(hapmapf64)

36101376

In [25]:
# average maf of the hapmap3 data set
mean(maf)

0.222585591341583

In [26]:
# estimated memory usage if convert to SparseMatrixCSC{Float32, UInt32} matrix
estimatesize(people, snps, mean(maf), SparseMatrixCSC{Float32, UInt32})

1.4338389205819245e7

In [27]:
# actual memory usage if convert to SparseMatrixCSC{Float32, UInt32} matrix
hapmapf32sp = convert(SparseMatrixCSC{Float32, UInt32}, hapmap)
Base.summarysize(hapmapf32sp)

12974764

By default `convert()` method does **not** impute missing genotypes but convert them to `NaN`.

In [28]:
# number of missing genotypes
countnz(isnan(hapmap)), countnz(isnan(hapmapf64))

(11894,11894)

We can enforce imputation by setting optional argument `impute=true`. Imputation is done by generating two random alleles according to the minor allele frequency. This is not an optimal strategy and users should make sure genotypes are imputed with high quality using other advanced methods.

In [29]:
hapmapf64impute = convert(Matrix{Float64}, hapmap; impute = true)
countnz(isnan(hapmapf64impute))

0

By default `convert()` translates genotypes according to the *additive* SNP model, which essentially is the **minor allele** counts (0, 1 or 2). Other SNP models are *dominant* and *recessive*, both in terms of the **minor allele**. When `A1` is the minor allele, genotype is translated to real number according to

| Genotype | `SnpArray` | `model=:additive` | `model=:dominant` | `model=:recessive` |    
|---|---|---|---|---|  
| A1,A1 | 00 | 2 | 1 | 1 |  
| A1,A2 | 01 | 1 | 1 | 0 |  
| A2,A2 | 11 | 0 | 0 | 0 |  
| missing | 10 | NaN | NaN | NaN | 

When `A2` is the minor allele, genotype is translated according to

| Genotype | `SnpArray` | `model=:additive` | `model=:dominant` | `model=:recessive` |    
|---|---|---|---|---|  
| A1,A1 | 00 | 0 | 0 | 0 |  
| A1,A2 | 01 | 1 | 1 | 0 |  
| A2,A2 | 11 | 2 | 1 | 1 |  
| missing | 10 | NaN | NaN | NaN |

In [30]:
[convert(Vector{Float64}, hapmap[1:10, 5]; model = :additive) convert(Vector{Float64}, hapmap[1:10, 5]; model = :dominant) convert(Vector{Float64}, hapmap[1:10, 5]; model = :recessive)]

10x3 Array{Float64,2}:
 0.0  0.0  0.0
 0.0  0.0  0.0
 1.0  1.0  0.0
 1.0  1.0  0.0
 0.0  0.0  0.0
 2.0  1.0  1.0
 2.0  1.0  1.0
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0

By default `convert()` does **not** center and scale genotypes. Setting optional arguments `center=true, scale=true` centers genotypes at 2MAF and scales them by $[2 \cdot \text{MAF} \cdot (1 - \text{MAF})]^{-1/2}$. Mono-allelic SNPs (MAF=0) are not scaled.

In [31]:
[convert(Vector{Float64}, hapmap[:, 5]) convert(Vector{Float64}, hapmap[:, 5]; center = true, scale = true)]

324x2 Array{Float64,2}:
 0.0  -1.25702 
 0.0  -1.25702 
 1.0   0.167017
 1.0   0.167017
 0.0  -1.25702 
 2.0   1.59106 
 2.0   1.59106 
 0.0  -1.25702 
 0.0  -1.25702 
 0.0  -1.25702 
 0.0  -1.25702 
 0.0  -1.25702 
 1.0   0.167017
 ⋮             
 2.0   1.59106 
 0.0  -1.25702 
 1.0   0.167017
 0.0  -1.25702 
 0.0  -1.25702 
 0.0  -1.25702 
 0.0  -1.25702 
 0.0  -1.25702 
 1.0   0.167017
 0.0  -1.25702 
 0.0  -1.25702 
 0.0  -1.25702 

`copy!()` is the in-place version of `convert()`. Options such as GWAS loop over SNPs and perform statistical anlaysis for each SNP. This can be achieved by

In [32]:
g = zeros(people)
for j = 1:snps
    copy!(g, hapmap[:, j]; model = :additive, impute = true)
    # do statistical anlaysis
end

## Empirical kinship

`grm()` computes the empirical kinship matrix using either the genetic relationship matrix (`model=:GRM`, default) or the method of moment method (`model=:MoM`). Missing genotypes are imputed according to minor allele frequencies on the fly.

In [33]:
# GRM using all SNPs
grm(hapmap)

324x324 Array{Float64,2}:
 0.566416   0.0442463  0.018849   …  0.0625497  0.0686532  0.0623838
 0.0442463  0.530606   0.0311836     0.0492683  0.0433574  0.0605358
 0.018849   0.0311836  0.51111       0.0443478  0.029146   0.0345664
 0.0463389  0.0356874  0.0275817     0.0576113  0.0632411  0.0575781
 0.0503302  0.0410834  0.0242129     0.0688273  0.0560492  0.0634781
 0.042147   0.0307217  0.0364152  …  0.0677623  0.0541995  0.0624585
 0.0378372  0.021092   0.0111269     0.0426436  0.0364779  0.0355157
 0.039818   0.0370717  0.0204989     0.0553709  0.0528329  0.0640746
 0.02878    0.0302664  0.0161618     0.0323108  0.0446895  0.0362039
 0.0377034  0.0407329  0.0253116     0.0641344  0.0551614  0.0469087
 0.04632    0.0436639  0.0225126  …  0.0564845  0.0644777  0.0589484
 0.0579301  0.0373691  0.0353352     0.0632216  0.0561654  0.0696915
 0.0345738  0.0421564  0.025246      0.0548924  0.0675433  0.0609996
 ⋮                                ⋱                                 
 0.06311

In [34]:
# GRM using every other SNP
grm(sub(hapmap, :, 1:2:snps))

324x324 Array{Float64,2}:
 0.555711   0.0413792  0.0268101  …  0.0649168  0.0712169  0.0658343
 0.0413792  0.545397   0.0354551     0.0560558  0.0441178  0.0537197
 0.0268101  0.0354551  0.501378      0.0379049  0.0373067  0.0451489
 0.0431157  0.0441028  0.0246234     0.0492876  0.0595074  0.0541423
 0.0501599  0.0462128  0.0245604     0.0645959  0.0548739  0.0630694
 0.0509204  0.0394032  0.0386632  …  0.0744758  0.0599421  0.0503997
 0.0380842  0.0257835  0.0157169     0.0452693  0.0331344  0.0330554
 0.0456764  0.0366144  0.0248477     0.0569215  0.0538854  0.0664857
 0.0256848  0.0224182  0.0180971     0.0319316  0.0416416  0.0345233
 0.0302425  0.0382948  0.0235541     0.057457   0.0400379  0.0488062
 0.0470478  0.0488069  0.0198282  …  0.0605351  0.0626474  0.0510378
 0.0599984  0.0477784  0.0393928     0.0635521  0.0625181  0.0624833
 0.0241962  0.0415118  0.0208591     0.0524789  0.0659913  0.0609764
 ⋮                                ⋱                                 
 0.05781

In [35]:
# MoM using all SNPs
grm(hapmap; method = :MoM)

324x324 Array{Float64,2}:
 0.539332    0.0353244  0.00225925  …  0.0541007  0.063784   0.0506761
 0.0353244   0.518076   0.0143044      0.0419374  0.0393394  0.0503218
 0.00225925  0.0143044  0.498945       0.0318998  0.0209174  0.0193823
 0.0431183   0.0291837  0.0233973      0.0522112  0.0690981  0.0496133
 0.0448897   0.033553   0.0146587      0.0649649  0.0562263  0.0559901
 0.0322541   0.0210355  0.0259953   …  0.0594147  0.0503218  0.049259 
 0.0248144   0.0112341  0.0028497      0.0316636  0.030837   0.0244601
 0.0261134   0.0287114  0.00958082     0.0388671  0.0463067  0.0468972
 0.0211536   0.0259953  0.0134778      0.0300104  0.0487866  0.0301284
 0.0219802   0.0261134  0.011116       0.053274   0.0509122  0.032136 
 0.0359148   0.0333169  0.00556576  …  0.0493771  0.0559901  0.0498494
 0.0487866   0.0324902  0.0267038      0.0562263  0.0556358  0.0520931
 0.0309551   0.0427641  0.0252868      0.0591785  0.070279   0.0576434
 ⋮                                  ⋱              

## Principal component analysis 

Principal compoenent analysis is widely used in genome-wide association analysis (GWAS) for adjusting population substructure. `pca(A, pcs)` computes the top `pcs` principal components based on SNP data `A`. Each SNP is centered at $2\text{MAF}$ and scaled by $[2\text{MAF}(1-\text{MAF})]^{-1/2}$. The output is  

* `pcscore`: top `pcs` eigen-SNPs, or principal scores, in each column  
* `pcloading`: top `pcs` eigen-vectors, or principal loadings, in each column  
* `pcvariance`: top `pcs` eigen-values, or principal variances



Missing genotypes are imputed according minor allele frequencies on the fly. This means, in presence of missing genotypes, running the function on the same `SnpArray` twice may produce slightly different answers. For reproducibility, it is a good practice to set random seed before each function that does imputation on the fly.

In [36]:
srand(123) # set seed
pcscore, pcloading, pcvariance = pca(hapmap, 3)

(
324x3 Array{Float64,2}:
 -38.7231  -1.2983     -7.00541  
 -32.6096  -1.21052    -3.3232   
 -23.0215  -0.505397   12.1751   
 -35.692   -2.76103    -2.40055  
 -37.1815  -0.132498   -3.66829  
 -34.9285  -1.11368     6.14167  
 -22.0323  -5.70536     2.02968  
 -30.9994  -2.28269    -0.0893283
 -22.8432  -3.76024     7.97486  
 -32.2024  -0.239253    2.91168  
 -36.344   -0.773184   -5.31525  
 -35.886   -0.807234    0.279053 
 -33.9423  -3.78982     7.35677  
   ⋮                             
 -49.1282   0.913683   10.4061   
 -46.9862  -0.9654     -0.435579 
 -48.5334  -1.05076    -0.15223  
 -49.0331   0.379279    5.65431  
 -47.8714  -0.406195   -7.14605  
 -48.2028  -1.41369    -0.564107 
 -46.7128  -3.36643    -4.44341  
 -48.9006  -1.69293     0.0467995
 -48.5574   1.34936    -1.89814  
 -50.2291   0.0865293  -1.94494  
 -48.9263  -2.06102     2.17374  
 -48.8627   0.274894    6.49518  ,

13928x3 Array{Float64,2}:
  9.66817e-20   7.35949e-19   5.79015e-19
  0.00143962   -0.00

To use eigen-SNPs for plotting or covariates in GWAS, we typically scale them by their standard deviations so that they have mean zero and unit variance.

In [37]:
# standardize eigen-SNPs before plotting or GWAS
scale!(pcscore, 1.0 ./ √(pcvariance))
std(pcscore, 1)

1x3 Array{Float64,2}:
 1.0  1.0  1.0

Internally `pca()` converts `SnpArray` to the matrix of minor allele counts. The default format is `Matrix{Float64}`, which can easily exceed memory limit. Users have several options when the default `Matrix{Float64}` cannot fit into memory.  

* Use other intermediate matrix types.

In [38]:
# use single precision matrix and display the principal variances
# approximately same answer as double precision
srand(123)
pca(hapmap, 3, Matrix{Float32})[3]

3-element Array{Float32,1}:
 1841.39  
  225.324 
   70.7085

* Use subset of SNPs

In [39]:
# principal components using every other SNP capture about half the variance
srand(123)
pca(sub(hapmap, :, 1:2:snps), 3)[3]

3-element Array{Float64,1}:
 926.622 
 113.188 
  36.4866

* Use sparse matrix. For large data sets with majority of rare variants, `pca_sp()` is more efficient by first converting `SnpArray` to a sparse matrix (default is `SparseMatrixCSC{Float64, Int64}`) and then computing principal components using iterative algorithms. 

In [40]:
# approximately same answer if we use Float16 sparse matrix
srand(123)
pca_sp(hapmap, 3, SparseMatrixCSC{Float16, UInt32})[3]

3-element Array{Float64,1}:
 1841.4   
  225.31  
   70.7094

In [41]:
# approximately same answer if we use Int8 sparse matrix
srand(123)
pca_sp(hapmap, 3, SparseMatrixCSC{Int8, UInt32})[3]

3-element Array{Float64,1}:
 1841.4   
  225.328 
   70.7119

In [42]:
versioninfo()

Julia Version 0.4.5
Commit 2ac304d (2016-03-18 00:58 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3
