# SnpArrays.jl

The module `SnpArrays` implements the `SnpArray` type for handling biallelic genotypes. `SnpArray` is an array of `Tuple{Bool,Bool}` and adopts the same coding as the [Plink binary format](http://pngu.mgh.harvard.edu/~purcell/plink/binary.shtml). If `A1` and `A2` are two alleles, the coding rule is  

| Genotype | SnpArray |  
|---|---|---|  
| A1,A1 | 00 |  
| A1,A2 | 01 |  
| A2,A2 | 11 |  
| missing | 10 |  
The code `10==(true,false)` is reserved for missing genotype. Otherwise the bit `1==true` represents one copy of allele `A2`. For a two-dimensional `SnpArray`, it assumes each row is a person and each column is a SNP.

## Constructor

There are various ways to initialize a `SnpArray`.  

* `SnpArray` can be initialized from [Plink binary files](http://pngu.mgh.harvard.edu/~purcell/plink/binary.shtml), say an example data set of the MAP4 gene on chromosome 3:

In [1]:
;ls -al chr3-map4-geno.*

-rw-r--r--  1 hzhou3  staff  215043 Jul 16  2014 chr3-map4-geno.bed
-rw-r--r--  1 hzhou3  staff   25088 Jul 16  2014 chr3-map4-geno.bim
-rw-r--r--  1 hzhou3  staff   39356 Jul 16  2014 chr3-map4-geno.fam
-rw-r--r--  1 hzhou3  staff    2321 Jul 16  2014 chr3-map4-geno.log


In [2]:
include("../src/SnpArrays.jl")
using SnpArrays
map4 = SnpArray("chr3-map4-geno")

959x896 SnpArrays.SnpArray{2}:
 (true,true)   (true,true)  (true,true)   …  (false,true)   (true,true)
 (false,true)  (true,true)  (true,true)      (false,true)   (true,true)
 (true,true)   (true,true)  (true,true)      (true,true)    (true,true)
 (false,true)  (true,true)  (true,true)      (false,true)   (true,true)
 (true,true)   (true,true)  (true,true)      (true,true)    (true,true)
 (true,true)   (true,true)  (true,true)   …  (true,true)    (true,true)
 (false,true)  (true,true)  (true,true)      (false,false)  (true,true)
 (true,true)   (true,true)  (true,true)      (false,true)   (true,true)
 (true,true)   (true,true)  (false,true)     (false,true)   (true,true)
 (true,true)   (true,true)  (true,true)      (true,true)    (true,true)
 (true,true)   (true,true)  (true,true)   …  (true,true)    (true,true)
 (true,true)   (true,true)  (true,true)      (true,true)    (true,true)
 (false,true)  (true,true)  (true,true)      (false,true)   (true,true)
 ⋮                               

In [3]:
# rows are persons; columns are SNPs
people, snps = size(map4)

(959,896)

Internally `SnpArray` stores data as `BitArray`s and consumes approximately amount of memrory as the Plink `bed` file size.

In [4]:
# memory usage
Base.summarysize(map4)

214896

* `SnpArray` can be initialized from a matrix of A1 allele counts.

In [5]:
SnpArray(rand(0:2, 5, 3))

5x3 SnpArrays.SnpArray{2}:
 (false,false)  (false,false)  (true,true)  
 (false,false)  (true,true)    (false,true) 
 (false,false)  (true,true)    (false,true) 
 (false,true)   (true,true)    (false,false)
 (false,false)  (false,true)   (false,true) 

* `SnpArray(m, n)` generates an m by n `SnpArray` with all A1 alleles.

In [6]:
s = SnpArray(5, 3)

5x3 SnpArrays.SnpArray{2}:
 (false,false)  (false,false)  (false,false)
 (false,false)  (false,false)  (false,false)
 (false,false)  (false,false)  (false,false)
 (false,false)  (false,false)  (false,false)
 (false,false)  (false,false)  (false,false)

## Summary statistics

`summarize()` computes  
* `maf`: minor allele frequencies, taking into account of missingness.  
* `minor_allele`: a BitVector indicating the minor allele for each SNP. `minor_allele[j]==true` means A1 is the minor allele for SNP j. 
* `missings_by_snp`: number of missing genotypes for each snp.  
* `missing_by_person`: number of missing genotypes for each person.

In [7]:
maf, minor_allele, missings_by_snp, missings_by_person = summarize(map4)
# minor allele frequencies
maf'

1x896 Array{Float64,2}:
 0.226799  0.00208551  0.0260688  …  0.00260688  0.382046  0.00260688

In [8]:
# total number of missing genotypes
sum(missings_by_snp), sum(missings_by_person)

(218,218)

## Random genotypes

`randgeno(a1freq)` generates a random genotype according to allele A1 frequency `a1freq`.

In [9]:
randgeno(0.5)

(true,true)

`randgeno(maf, minor_allele)` generates a random genotype according to minor allele frequency `maf` and whether the minor allele is A1 (`minor_allele==true`) or A2 (`minor_allele==false`).

In [10]:
randgeno(0.5, true)

(true,true)

`randgeno(n, maf, minor_allele)` generates a vector of random genotypes according to a common minor allele frequency `maf` and the minor allele.

In [11]:
randgeno(10, 0.25, true)

10-element SnpArrays.SnpArray{1}:
 (false,true) 
 (false,true) 
 (false,false)
 (true,true)  
 (false,true) 
 (true,true)  
 (false,false)
 (true,true)  
 (true,true)  
 (false,true) 

`randgeno(m, n, maf, minor_allele)` generates a random $m$-by-$n$ `SnpArray` according to a vector of minor allele frequencies `maf` and a minor allele indicator vector. Lengths of both vectors should be $n$.

In [12]:
# this is a random replicate of map4 data
randgeno(size(map4), maf, minor_allele)

959x896 SnpArrays.SnpArray{2}:
 (false,true)   (true,true)  (true,true)   …  (false,true)   (true,true) 
 (true,true)    (true,true)  (true,true)      (false,true)   (true,true) 
 (false,true)   (true,true)  (true,true)      (true,true)    (true,true) 
 (true,true)    (true,true)  (true,true)      (true,true)    (true,true) 
 (false,true)   (true,true)  (true,true)      (false,true)   (false,true)
 (false,true)   (true,true)  (true,true)   …  (false,true)   (true,true) 
 (true,true)    (true,true)  (true,true)      (false,true)   (true,true) 
 (true,true)    (true,true)  (true,true)      (false,true)   (true,true) 
 (false,true)   (true,true)  (true,true)      (true,true)    (true,true) 
 (false,true)   (true,true)  (true,true)      (false,true)   (true,true) 
 (false,true)   (true,true)  (true,true)   …  (false,true)   (true,true) 
 (false,true)   (true,true)  (true,true)      (true,true)    (true,true) 
 (true,true)    (true,true)  (true,true)      (true,true)    (true,true) 
 ⋮     

## Subsetting

Subsetting a `SnpArray` works the same way as subsetting any other arrays.

In [13]:
# genotypes of the 1st individual
map4[1, :]

1x896 SnpArrays.SnpArray{2}:
 (true,true)  (true,true)  (true,true)  …  (false,true)  (true,true)

In [14]:
# genotypes of the 5th SNP
map4[:, 5]

959-element SnpArrays.SnpArray{1}:
 (true,true)
 (true,true)
 (true,true)
 (true,true)
 (true,true)
 (true,true)
 (true,true)
 (true,true)
 (true,true)
 (true,true)
 (true,true)
 (true,true)
 (true,true)
 ⋮          
 (true,true)
 (true,true)
 (true,true)
 (true,true)
 (true,true)
 (true,true)
 (true,true)
 (true,true)
 (true,true)
 (true,true)
 (true,true)
 (true,true)

In [15]:
# subsetting both individuals and SNPs
map4[1:5, 5:10]

5x6 SnpArrays.SnpArray{2}:
 (true,true)  (false,true)  (true,true)  …  (true,true)  (true,true)
 (true,true)  (false,true)  (true,true)     (true,true)  (true,true)
 (true,true)  (true,true)   (true,true)     (true,true)  (true,true)
 (true,true)  (false,true)  (true,true)     (true,true)  (true,true)
 (true,true)  (true,true)   (true,true)     (true,true)  (true,true)

In [16]:
# filter out rare SNPs with MAF < 0.05
map4[:, maf .>= 0.05]

959x150 SnpArrays.SnpArray{2}:
 (true,true)   (false,true)   (true,true)   …  (false,true)   (false,true) 
 (false,true)  (false,true)   (false,true)     (false,true)   (false,true) 
 (true,true)   (true,true)    (true,true)      (true,true)    (true,true)  
 (false,true)  (false,true)   (true,true)      (false,true)   (false,true) 
 (true,true)   (true,true)    (false,true)     (true,true)    (true,true)  
 (true,true)   (true,true)    (true,true)   …  (true,true)    (true,true)  
 (false,true)  (false,false)  (true,true)      (false,false)  (false,false)
 (true,true)   (true,true)    (true,true)      (true,true)    (false,true) 
 (true,true)   (false,false)  (true,true)      (false,false)  (false,true) 
 (true,true)   (false,true)   (true,true)      (false,true)   (true,true)  
 (true,true)   (true,true)    (true,true)   …  (true,true)    (true,true)  
 (true,true)   (true,true)    (true,true)      (true,true)    (true,true)  
 (false,true)  (false,true)   (true,true)      (false,tru

In [17]:
# filter out individuals with genotyping success rate < 0.999
map4[missings_by_person / people .< 0.001, :]

817x896 SnpArrays.SnpArray{2}:
 (true,true)    (true,true)  (true,true)   …  (false,true)   (true,true)
 (false,true)   (true,true)  (true,true)      (false,true)   (true,true)
 (true,true)    (true,true)  (true,true)      (true,true)    (true,true)
 (false,true)   (true,true)  (true,true)      (false,true)   (true,true)
 (true,true)    (true,true)  (true,true)      (true,true)    (true,true)
 (true,true)    (true,true)  (true,true)   …  (true,true)    (true,true)
 (false,true)   (true,true)  (true,true)      (false,false)  (true,true)
 (true,true)    (true,true)  (true,true)      (false,true)   (true,true)
 (true,true)    (true,true)  (false,true)     (false,true)   (true,true)
 (true,true)    (true,true)  (true,true)      (true,true)    (true,true)
 (true,true)    (true,true)  (true,true)   …  (true,true)    (true,true)
 (false,true)   (true,true)  (true,true)      (false,true)   (true,true)
 (false,false)  (true,true)  (true,true)      (false,false)  (true,true)
 ⋮                  

`sub()` and `slice()` create views of subarray without copying data and improve efficiency in many calculations.

In [18]:
mafcommon, = summarize(sub(map4, :, maf .>= 0.05))
mafcommon'

1x150 Array{Float64,2}:
 0.226799  0.397704  0.0630865  0.206465  …  0.354015  0.390511  0.382046

## Assignment

It is possible to assign specific genotypes to a `SnpArray` entry.

In [19]:
map4[1, 1]

(true,true)

In [20]:
map4[1, 1] = (false, true)
map4[1, 1]

(false,true)

In [21]:
map4[1, 1] = NaN
map4[1, 1]

(true,false)

In [22]:
map4[1, 1] = 2
map4[1, 1]

(true,true)

Subsetted assignment such as `map4[:, 1] = NaN` is also valid.

## Copy, convert, and imputation

In most analysis we convert the whole `SnpArray` or slices of it to numeric arrays for computations. Keep in mind the storage of resultant data can be up to 32 fold that of `SnpArray`. Below are estimates of memory usage for some common data types with $n$ individuals and $p$ SNPs. Here MAF denotes the **average** minor allele frequencies.

* `SnpArray`: $0.25np$ bytes  
* `Matrix{Int8}`: $np$ bytes  
* `Matrix{Float32}`: $4np$ bytes  
* `Matrix{Float64}`: $8np$ bytes  
* `SparseMatrixCSC{Float64,Int64}`: $16 \cdot \text{NNZ} + 8(p+1) \approx 16np(2\text{MAF}(1-\text{MAF})+\text{MAF}^2) + 8(p+1) = 16np \cdot \text{MAF}(2-\text{MAF}) + 8(p+1)$ bytes. When average MAF=0.25, it is about $7np$ bytes. When MAF=0.025, it is about $0.8np$ bypes, 10 fold smaller than `Matrix{Float64}` type.  
* `SparseMatrixCSC{Bool,Int64}`: $2np \cdot \text{MAF} \cdot 9 + 16(p+1) = 18 np \cdot \text{MAF} + 16(p+1)$ bytes. When average MAF=0.25, it is about $4.5np$ bytes. When MAF=0.045, it is about $0.8np$ bytes, 10 fold smaller than `Matrix{Float64}` type.  

To be concrete, consider 2 typical data sets:  
* COPD (GWAS): $n = 6670$ individuals, $p = 630998$ SNPs, average MAF is 0.2454.
* GAW19 (sequencing study): $n = 1943$ individuals, $p = 1711766$ SNPs, average MAF is 0.00499.  

| Data Type | COPD | GAW19 |  
|---|---|---|  
| `SnpArray` | 1.05GB | 0.83GB |  
| `Matrix{Float64}` | 33.67GB | 26.61GB |  
| `SparseMatrixCSC{Float64,Int64}` | 29GB | 0.543GB |  
| `SparseMatrixCSC{Bool,Int64}` | 18.6GB | 0.326GB |  

Apparently for data sets with majority of rare variants, converting to sparse matrices saves memory usage and often brings computational advantages too. When choosing the integer type of the row indices `rowval` and column pointer `colptr` in the `SparseMatrixCSC` format, make sure its maximum is larger than the number of nonzeros in the matrix.

In [23]:
map4f64 = convert(Matrix{Float64}, map4)
# memory usage if convert to Float64
Base.summarysize(map4f64)

6874112

In [24]:
# average maf
mean(maf)

0.059234694204908275

In [25]:
map4f64sp = convert(SparseMatrixCSC{Float64, Int64}, map4)
# memory usage if convert to sparse Float64 matrix
Base.summarysize(map4f64sp)

1390144

In [26]:
map4f32sp = convert(SparseMatrixCSC{Float32, UInt32}, map4)
# memory usage if convert to sparse Float32 matrix
Base.summarysize(map4f32sp)

695092

By default `convert()` method does **not** impute missing genotypes but convert them to `NaN`.

In [27]:
# number of missing genotypes
countnz(isnan(map4)), countnz(isnan(map4f64))

(218,218)

We can enforce imputation by setting optional argument `impute=true`. Imputation is done by generating two random alleles according to the minor allele frequency. This is not an optimal strategy and users should make sure genotypes are imputed with high quality using other advanced methods.

In [28]:
map4f64impute = convert(Matrix{Float64}, map4; impute = true)
countnz(isnan(map4f64impute))

0

By default `convert()` translates genotypes according to the *additive* SNP model, which essentially is the **minor allele** counts (0, 1 or 2). Other SNP models are *dominant* and *recessive*, both in terms of the **minor allele**. When `A1` is the minor allele, genotype is translated to real number according to

| Genotype | `SnpArray` | `model=:additive` | `model=:dominant` | `model=:recessive` |    
|---|---|---|---|---|  
| A1,A1 | 00 | 2 | 1 | 1 |  
| A1,A2 | 01 | 1 | 1 | 0 |  
| A2,A2 | 11 | 0 | 0 | 0 |  
| missing | 10 | NaN | NaN | NaN | 

When `A2` is the minor allele, genotype is translated according to

| Genotype | `SnpArray` | `model=:additive` | `model=:dominant` | `model=:recessive` |    
|---|---|---|---|---|  
| A1,A1 | 00 | 0 | 0 | 0 |  
| A1,A2 | 01 | 1 | 1 | 0 |  
| A2,A2 | 11 | 2 | 1 | 1 |  
| missing | 10 | NaN | NaN | NaN |

In [29]:
[convert(Vector{Float64}, map4[1:10, 1]; model = :additive) convert(Vector{Float64}, map4[1:10, 1]; model = :dominant) convert(Vector{Float64}, map4[1:10, 1]; model = :recessive)]

10x3 Array{Float64,2}:
 0.0  0.0  0.0
 1.0  1.0  0.0
 0.0  0.0  0.0
 1.0  1.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0
 1.0  1.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0

By default `convert()` does **not** center and scale genotypes. Setting optional arguments `center=true, scale=true` centers genotypes at 2MAF and scales them by $[2 \cdot MAF \cdot (1 - MAF)]^{-1/2}$. Mono-allelic SNPs (MAF=0) are not scaled.

In [30]:
[convert(Vector{Float64}, map4[:, 1]) convert(Vector{Float64}, map4[:, 1]; center = true, scale = true)]

959x2 Array{Float64,2}:
 0.0  -0.76593 
 1.0   0.922637
 0.0  -0.76593 
 1.0   0.922637
 0.0  -0.76593 
 0.0  -0.76593 
 1.0   0.922637
 0.0  -0.76593 
 0.0  -0.76593 
 0.0  -0.76593 
 0.0  -0.76593 
 0.0  -0.76593 
 1.0   0.922637
 ⋮             
 0.0  -0.76593 
 0.0  -0.76593 
 0.0  -0.76593 
 0.0  -0.76593 
 1.0   0.922637
 1.0   0.922637
 1.0   0.922637
 1.0   0.922637
 1.0   0.922637
 1.0   0.922637
 0.0  -0.76593 
 0.0  -0.76593 

`copy!()` is the in-place version of `convert()`. Options such as GWAS loop over SNPs and perform statistical anlaysis for each SNP. This can be achieved by

In [31]:
g = zeros(size(map4, 1))
for j = 1:size(map4, 2)
    copy!(g, map4[:, j]; model = :additive, impute = true)
    # do statistical anlaysis
end

## Empirical kinship

`grm()` computes the empirical kinship matrix using either the genetic relationship matrix (`model=:GRM`, default) or the method of moment method (`model=:MoM`). Missing genotypes are imputed according to minor allele frequencies on the fly.

In [32]:
# GRM using all SNPs
grm(map4)

959x959 Array{Float64,2}:
  0.380544    -0.001171     -0.0288245  …  -0.028186    0.0197438  
 -0.001171     0.0910801    -0.0226232     -0.0126247  -0.000119002
 -0.0288245   -0.0226232     0.641305       0.0814241  -0.0255233  
 -0.0050335    0.0150372    -0.0215443     -0.025145   -0.0039815  
 -0.024474    -0.022623      0.073076       0.142048   -0.0285776  
 -0.024404    -0.0256074     0.076275   …   0.0715203  -0.0285076  
  0.0453891    0.0370275    -0.130117      -0.128325    0.0418238  
 -0.0218192   -0.0230977     0.0714131      0.0720706  -0.0243318  
  0.032136    -0.00637036   -0.124575      -0.122782    0.016663   
 -0.00733452  -0.0426735    -0.0116094     -0.0109709  -0.0417872  
 -0.0311827   -0.0220789     0.089408   …   0.080414   -0.0266451  
 -0.0300756   -0.0226379     0.0839075      0.081521   -0.025538   
 -0.00867467   0.0113961    -0.0251854     -0.0287862  -0.00762267 
  ⋮                                     ⋱                          
 -0.0215018   -0.01965

In [33]:
# GRM using every other SNP
grm(sub(map4, :, 1:2:snps))

959x959 Array{Float64,2}:
  0.07881      -0.000853301  -0.0235223  …  -0.0231723    0.0184407  
 -0.000853301   0.0354295    -0.0250093     -0.00734379  -0.000482591
 -0.0235223    -0.0250093     0.656679       0.0801822   -0.0209578  
 -0.00124231    0.0177249    -0.0253983     -0.0250484   -0.000871595
 -0.0204739    -0.0210335     0.0720974      0.123584    -0.0264231  
 -0.0174959    -0.0274966     0.0705212  …   0.0685633   -0.0234451  
  0.0395097     0.0410407    -0.129707      -0.127049     0.0376866  
 -0.0157941    -0.0219088     0.0735532      0.0715953   -0.0178573  
  0.0320641    -0.0128681    -0.129525      -0.126867     0.00642591 
 -0.0098777    -0.0480998    -0.0183572     -0.0180072   -0.0440483  
 -0.0189452    -0.0226958     0.0799405  …   0.0779826   -0.0186442  
 -0.0201318    -0.0238824     0.0787539      0.076796    -0.0198309  
 -0.0035757     0.0153915    -0.0277317     -0.0273818   -0.00320499 
  ⋮                                      ⋱                      

In [34]:
# MoM using all SNPs
grm(map4; method = :MoM)

959x959 Array{Float64,2}:
  0.140001     -0.00682793   0.0211395    …   0.0281314     0.0281314 
 -0.00682793    0.119026     0.0630907        0.0910581    -0.0138198 
  0.0211395     0.0630907    0.92309          0.874147      0.0281314 
 -0.0417872     0.0560988    0.049107         0.0421151    -0.0487791 
  0.0211395     0.0281314    0.804228         0.853171     -0.00682793
  0.0211395     0.0281314    0.81122      …   0.797236     -0.00682793
 -0.00682793   -0.00682793  -0.775933        -0.761949     -0.0417872 
  0.0630907     0.0630907    0.81122          0.818212      0.0351232 
 -0.237559     -0.321462    -0.831867        -0.817884     -0.27951   
 -0.293494     -0.342437     0.0281314        0.0351232    -0.363413  
  0.000163934   0.0560988    0.88813      …   0.860163      0.0141477 
  0.000163934   0.0560988    0.860163         0.867155      0.0141477 
 -0.0837384     0.0141477    0.00715579       0.000163934  -0.0907303 
  ⋮                                       ⋱        

## Principal component analysis 

Principal compoenent analysis is widely used in genomics for adjusting population substructure. `pca(A, pcs)` computes the top `pcs` principal components based on SNP data `A`. Missing genotypes are imputed according minor allele frequencies on the fly. This means, in presence of missing genotypes, running the function on the same `SnpArray` twice may produce slightly different answers. Each SNP is centered at $2\text{MAF}$ and scaled by $[2\text{MAF}(1-\text{MAF})]^{-1/2}$. Internally `pca()` converts `SnpArray` to the matrix of minor allele counts. The default is `Matrix{Float64}`, which can easily exceed memory size. Users can optionally choose single precision matrix format in the third argument `pca(A, pcs, Matrix{Float32})`. The output is  

* `pcscore`: the `pcs` eigen-SNPs, or principal scores, in each column  
* `pcloading`: the `pcs` eigen-vectors, or principal loadings, in each column  
* `pcvariance`: the `pcs` eigen-values, or principal variances

In [35]:
pcscore, pcloading, pcvariance = pca(map4, 3)

(
959x3 Array{Float64,2}:
   3.10871    0.609436    1.40923  
   3.18514    0.967911    2.01049  
 -12.2077     1.13888     0.0260712
   3.76373    1.0381      2.33647  
 -12.3111     0.571372   -0.251602 
 -12.1212     1.10989    -0.184244 
  18.8282     0.682441    3.80209  
 -11.3023     1.18534     0.122347 
  18.0564    -4.58311   -30.8036   
  -6.42758  -75.3048     10.4511   
 -12.5002     1.21314    -0.134413 
 -12.364      0.793268   -0.020948 
   4.41182    1.15213     2.7549   
   ⋮                               
 -11.9735     1.15731    -0.381319 
 -12.0806     1.17853    -0.388299 
 -12.0806     1.17853    -0.388299 
 -12.0806     1.17853    -0.388299 
   3.56885    0.883671    2.3137   
   4.13172    0.997728    2.26763  
   4.13172    0.997728    2.26763  
   3.82445    1.0956      2.39448  
   4.07383    1.09186     2.44804  
   4.31087    1.0474      2.37593  
 -12.1906     0.95165    -0.106233 
   3.38863    0.668819    1.38409  ,

896x3 Array{Float64,2}:
  0.0683538 

In [36]:
# principal components using every other SNP
pcscore, pcloading, pcvariance = pca(map4[:, 1:2:snps], 3)

(
959x3 Array{Float64,2}:
  1.72009    0.650044    1.12881 
  2.30166    0.910005    1.54832 
 -8.68676    0.775582   -0.367164
  2.82714    0.950579    1.74903 
 -8.85737    0.356183   -0.502523
 -8.54455    0.880991   -0.476607
 13.0628     0.936325    3.33578 
 -7.81507    0.78786    -0.265526
 13.8132    -5.83978   -22.1125  
 -4.8833   -57.5106     14.8825  
 -8.36141    0.781483   -0.543623
 -8.24859    0.389077   -0.353896
  3.08666    1.01915     1.92353 
  ⋮                              
 -8.3019     0.787893   -0.453856
 -8.4542     0.815059   -0.470537
 -8.4542     0.815059   -0.470537
 -8.4542     0.815059   -0.470537
  2.58737    0.98336     1.79521 
  2.87896    0.981337    1.76161 
  2.87896    0.981337    1.76161 
  2.94132    0.963903    1.74712 
  2.94132    0.963903    1.74712 
  2.99314    0.994661    1.7597  
 -8.61823    0.718674   -0.489189
  2.05817    0.649151    1.10201 ,

448x3 Array{Float64,2}:
  0.0973563    0.0203351     0.0530207  
  0.0311161   -0.042689

For large data sets with majority of rare variants, `pca_sp()` is more efficient by first converting `SnpArray` to a sparse matrix (default `SparseMatrixCSC{Float64, Int64}`) and then computing principal components using iterative algorithms. 

In [37]:
pcscore, pcloading, pcvariance = pca_sp(map4, 3)

(
959x3 Array{Float64,2}:
   3.10831    0.608452    1.4096   
   3.18625    0.968882    2.01008  
 -12.2094     1.13682     0.0253001
   3.76257    1.03993     2.33681  
 -12.3112     0.569563   -0.251509 
 -12.1206     1.10748    -0.186229 
  18.8266     0.685152    3.80294  
 -11.3028     1.18306     0.117167 
  18.0608    -4.58264   -30.8032   
  -6.41371  -75.3046     10.454    
 -12.5017     1.2108     -0.135174 
 -12.2753     0.773877   -0.0255436
   4.41113    1.15452     2.75612  
   ⋮                               
 -11.9746     1.15521    -0.382019 
 -12.0816     1.1764     -0.389029 
 -12.0816     1.1764     -0.389029 
 -12.0816     1.1764     -0.389029 
   3.57006    0.885198    2.31334  
   4.13114    0.999646    2.26812  
   4.13114    0.999646    2.26812  
   3.82322    1.09754     2.39476  
   4.0728     1.09387     2.4484   
   4.31023    1.04946     2.37648  
 -12.1899     0.949322   -0.107583 
   3.388      0.667561    1.38386  ,

896x3 Array{Float64,2}:
  0.0683355 

In [38]:
versioninfo()

Julia Version 0.4.5
Commit 2ac304d (2016-03-18 00:58 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3
