# SnpArrays.jl

The module `SnpArrays` implements the `SnpArray` type for handling biallelic genotypes. `SnpArray` is an array of `Tuple{Bool,Bool}` and adopts the same coding as the [Plink binary format](http://pngu.mgh.harvard.edu/~purcell/plink/binary.shtml). If `A1` and `A2` are two alleles, the coding rule is  

| Genotype | SnpArray |  
|---|---|---|  
| A1,A1 | 00 |  
| A1,A2 | 01 |  
| A2,A2 | 11 |  
| missing | 10 |  
The code `10==(true,false)` is reserved for missing genotype. Otherwise the bit `1==true` represents one copy of allele `A2`. For a two-dimensional `SnpArray`, it assumes each row is a person and each column is a SNP.

## Constructor

There are various ways to initialize a `SnpArray`.  

* `SnpArray` can be initialized from [Plink binary files](http://pngu.mgh.harvard.edu/~purcell/plink/binary.shtml), say an example data set hapmap3:

In [1]:
;ls -al hapmap3.*

-rw-r--r--@ 1 hzhou3  staff  1128171 Jun  4  2010 hapmap3.bed
-rw-r--r--@ 1 hzhou3  staff   388672 Jun  4  2010 hapmap3.bim
-rw-r--r--@ 1 hzhou3  staff     7136 Jun  4  2010 hapmap3.fam
-rw-r--r--@ 1 hzhou3  staff   332960 Jun  4  2010 hapmap3.map


In [2]:
include("../src/SnpArrays.jl")
using SnpArrays
hapmap = SnpArray("hapmap3")

324x13928 SnpArrays.SnpArray{2}:
 (true,true)  (true,true)   (false,false)  …  (true,true)   (true,true)
 (true,true)  (false,true)  (false,true)      (false,true)  (true,true)
 (true,true)  (true,true)   (false,true)      (true,true)   (true,true)
 (true,true)  (true,true)   (false,true)      (true,true)   (true,true)
 (true,true)  (true,true)   (false,true)      (true,true)   (true,true)
 (true,true)  (false,true)  (true,true)    …  (true,true)   (true,true)
 (true,true)  (true,true)   (true,true)       (true,true)   (true,true)
 (true,true)  (true,true)   (false,false)     (true,true)   (true,true)
 (true,true)  (true,true)   (false,true)      (true,true)   (true,true)
 (true,true)  (true,true)   (false,true)      (true,true)   (true,true)
 (true,true)  (true,true)   (false,true)   …  (true,true)   (true,true)
 (true,true)  (true,true)   (true,true)       (true,true)   (true,true)
 (true,true)  (true,true)   (false,false)     (true,true)   (true,true)
 ⋮                             

In [3]:
# rows are persons; columns are SNPs
people, snps = size(hapmap)

(324,13928)

Internally `SnpArray` stores data as `BitArray`s and consumes approximately amount of memrory as the Plink `bed` file size.

In [4]:
# memory usage
Base.summarysize(hapmap)

1128256

* `SnpArray` can be initialized from a matrix of A1 allele counts.

In [5]:
SnpArray(rand(0:2, 5, 3))

5x3 SnpArrays.SnpArray{2}:
 (false,false)  (false,true)   (true,true)  
 (false,false)  (true,true)    (false,false)
 (false,true)   (false,false)  (true,true)  
 (false,false)  (true,true)    (true,true)  
 (false,true)   (false,true)   (true,true)  

* `SnpArray(m, n)` generates an m by n `SnpArray` with all A1 alleles.

In [6]:
s = SnpArray(5, 3)

5x3 SnpArrays.SnpArray{2}:
 (false,false)  (false,false)  (false,false)
 (false,false)  (false,false)  (false,false)
 (false,false)  (false,false)  (false,false)
 (false,false)  (false,false)  (false,false)
 (false,false)  (false,false)  (false,false)

## Summary statistics

`summarize()` computes  
* `maf`: minor allele frequencies, taking into account of missingness.  
* `minor_allele`: a BitVector indicating the minor allele for each SNP. `minor_allele[j]==true` means A1 is the minor allele for SNP j. `minor_allele[j]==false` means A2 is the minor allele for SNP j.  
* `missings_by_snp`: number of missing genotypes for each snp.  
* `missing_by_person`: number of missing genotypes for each person.

In [7]:
maf, minor_allele, missings_by_snp, missings_by_person = summarize(hapmap)
# minor allele frequencies
maf'

1x13928 Array{Float64,2}:
 0.0  0.0776398  0.324074  0.191589  …  0.00154321  0.0417957  0.00617284

In [8]:
# total number of missing genotypes
sum(missings_by_snp), sum(missings_by_person)

(11894,11894)

## Random genotypes

`randgeno(a1freq)` generates a random genotype according to allele A1 frequency `a1freq`.

In [9]:
randgeno(0.5)

(false,false)

`randgeno(maf, minor_allele)` generates a random genotype according to minor allele frequency `maf` and whether the minor allele is A1 (`minor_allele==true`) or A2 (`minor_allele==false`).

In [10]:
randgeno(0.5, true)

(true,true)

`randgeno(n, maf, minor_allele)` generates a vector of random genotypes according to a common minor allele frequency `maf` and the minor allele.

In [11]:
randgeno(10, 0.25, true)

10-element SnpArrays.SnpArray{1}:
 (true,true) 
 (true,true) 
 (false,true)
 (true,true) 
 (false,true)
 (true,true) 
 (true,true) 
 (false,true)
 (true,true) 
 (false,true)

`randgeno(m, n, maf, minor_allele)` generates a random $m$-by-$n$ `SnpArray` according to a vector of minor allele frequencies `maf` and a minor allele indicator vector. Lengths of both vectors should be $n$.

In [12]:
# this is a random replicate of map4 data
randgeno(size(hapmap), maf, minor_allele)

324x13928 SnpArrays.SnpArray{2}:
 (true,true)  (true,true)   (true,true)    …  (true,true)   (true,true)
 (true,true)  (true,true)   (false,false)     (true,true)   (true,true)
 (true,true)  (true,true)   (false,true)      (true,true)   (true,true)
 (true,true)  (false,true)  (false,true)      (true,true)   (true,true)
 (true,true)  (true,true)   (false,false)     (true,true)   (true,true)
 (true,true)  (true,true)   (false,true)   …  (false,true)  (true,true)
 (true,true)  (true,true)   (false,false)     (true,true)   (true,true)
 (true,true)  (true,true)   (true,true)       (true,true)   (true,true)
 (true,true)  (false,true)  (true,true)       (true,true)   (true,true)
 (true,true)  (false,true)  (false,false)     (true,true)   (true,true)
 (true,true)  (true,true)   (false,true)   …  (true,true)   (true,true)
 (true,true)  (true,true)   (false,true)      (true,true)   (true,true)
 (true,true)  (true,true)   (true,true)       (true,true)   (true,true)
 ⋮                             

## Subsetting

Subsetting a `SnpArray` works the same way as subsetting any other arrays.

In [13]:
# genotypes of the 1st individual
hapmap[1, :]

1x13928 SnpArrays.SnpArray{2}:
 (true,true)  (true,true)  (false,false)  …  (true,true)  (true,true)

In [14]:
# genotypes of the 5th SNP
hapmap[:, 5]

324-element SnpArrays.SnpArray{1}:
 (true,true)  
 (true,true)  
 (false,true) 
 (false,true) 
 (true,true)  
 (false,false)
 (false,false)
 (true,true)  
 (true,true)  
 (true,true)  
 (true,true)  
 (true,true)  
 (false,true) 
 ⋮            
 (false,false)
 (true,true)  
 (false,true) 
 (true,true)  
 (true,true)  
 (true,true)  
 (true,true)  
 (true,true)  
 (false,true) 
 (true,true)  
 (true,true)  
 (true,true)  

In [15]:
# subsetting both individuals and SNPs
hapmap[1:5, 5:10]

5x6 SnpArrays.SnpArray{2}:
 (true,true)   (true,true)  (false,true)  …  (true,true)   (false,true)
 (true,true)   (true,true)  (true,true)      (true,true)   (false,true)
 (false,true)  (true,true)  (true,true)      (false,true)  (true,true) 
 (false,true)  (true,true)  (true,true)      (true,true)   (false,true)
 (true,true)   (true,true)  (true,true)      (true,true)   (false,true)

In [16]:
# filter out rare SNPs with MAF < 0.05
hapmap[:, maf .>= 0.05]

324x12085 SnpArrays.SnpArray{2}:
 (true,true)   (false,false)  (true,true)   …  (false,true)  (false,true)
 (false,true)  (false,true)   (false,true)     (true,true)   (true,true) 
 (true,true)   (false,true)   (false,true)     (true,true)   (true,true) 
 (true,true)   (false,true)   (true,true)      (false,true)  (false,true)
 (true,true)   (false,true)   (false,true)     (true,true)   (true,true) 
 (false,true)  (true,true)    (true,true)   …  (false,true)  (false,true)
 (true,true)   (true,true)    (true,true)      (true,true)   (true,true) 
 (true,true)   (false,false)  (true,true)      (true,true)   (true,true) 
 (true,true)   (false,true)   (false,true)     (true,true)   (true,true) 
 (true,true)   (false,true)   (true,true)      (false,true)  (false,true)
 (true,true)   (false,true)   (true,true)   …  (true,true)   (true,true) 
 (true,true)   (true,true)    (false,true)     (false,true)  (false,true)
 (true,true)   (false,false)  (true,true)      (false,true)  (false,true)
 ⋮   

In [17]:
# filter out individuals with genotyping success rate < 0.90
hapmap[missings_by_person / people .< 0.1, :]

220x13928 SnpArrays.SnpArray{2}:
 (true,true)  (true,true)   (false,false)  …  (true,true)   (true,true)
 (true,true)  (false,true)  (false,true)      (false,true)  (true,true)
 (true,true)  (true,true)   (false,true)      (true,true)   (true,true)
 (true,true)  (true,true)   (true,true)       (true,true)   (true,true)
 (true,true)  (true,true)   (false,false)     (true,true)   (true,true)
 (true,true)  (true,true)   (false,true)   …  (true,true)   (true,true)
 (true,true)  (true,true)   (false,true)      (true,true)   (true,true)
 (true,true)  (true,true)   (false,true)      (true,true)   (true,true)
 (true,true)  (true,true)   (false,false)     (true,true)   (true,true)
 (true,true)  (true,true)   (false,true)      (true,true)   (true,true)
 (true,true)  (false,true)  (false,true)   …  (true,true)   (true,true)
 (true,true)  (true,true)   (true,true)       (true,true)   (true,true)
 (true,true)  (true,true)   (true,true)       (true,true)   (true,true)
 ⋮                             

`sub()` and `slice()` create views of subarray without copying data and improve efficiency in many calculations.

In [18]:
mafcommon, = summarize(sub(hapmap, :, maf .>= 0.05))
mafcommon'

1x12085 Array{Float64,2}:
 0.0776398  0.324074  0.191589  …  0.310937  0.23913  0.23913  0.23913

## Assignment

It is possible to assign specific genotypes to a `SnpArray` entry.

In [19]:
hapmap[1, 1]

(true,true)

In [20]:
hapmap[1, 1] = (false, true)
hapmap[1, 1]

(false,true)

In [21]:
hapmap[1, 1] = NaN
hapmap[1, 1]

(true,false)

In [22]:
hapmap[1, 1] = 2
hapmap[1, 1]

(true,true)

Subsetted assignment such as `map4[:, 1] = NaN` is also valid.

## Copy, convert, and imputation

In most analysis we convert the whole `SnpArray` or slices of it to numeric arrays for computations. Keep in mind the storage of resultant data can be up to 32 fold that of `SnpArray`. Below are estimates of memory usage for some common data types with $n$ individuals and $p$ SNPs. Here MAF denotes the **average** minor allele frequencies.

* `SnpArray`: $0.25np$ bytes  
* `Matrix{Int8}`: $np$ bytes  
* `Matrix{Float32}`: $4np$ bytes  
* `Matrix{Float64}`: $8np$ bytes  
* `SparseMatrixCSC{Float64,Int64}`: $16 \cdot \text{NNZ} + 8(p+1) \approx 16np(2\text{MAF}(1-\text{MAF})+\text{MAF}^2) + 8(p+1) = 16np \cdot \text{MAF}(2-\text{MAF}) + 8(p+1)$ bytes. When average MAF=0.25, it is about $7np$ bytes. When MAF=0.025, it is about $0.8np$ bypes, 10 fold smaller than `Matrix{Float64}` type.  
* `SparseMatrixCSC{Bool,Int64}`: $2np \cdot \text{MAF} \cdot 9 + 16(p+1) = 18 np \cdot \text{MAF} + 16(p+1)$ bytes. When average MAF=0.25, it is about $4.5np$ bytes. When MAF=0.045, it is about $0.8np$ bytes, 10 fold smaller than `Matrix{Float64}` type.  

To be concrete, consider 2 typical data sets:  
* COPD (GWAS): $n = 6670$ individuals, $p = 630998$ SNPs, average MAF is 0.2454.
* GAW19 (sequencing study): $n = 1943$ individuals, $p = 1711766$ SNPs, average MAF is 0.00499.  

| Data Type | COPD | GAW19 |  
|---|---|---|  
| `SnpArray` | 1.05GB | 0.83GB |  
| `Matrix{Float64}` | 33.67GB | 26.61GB |  
| `SparseMatrixCSC{Float64,Int64}` | 29GB | 0.543GB |  
| `SparseMatrixCSC{Bool,Int64}` | 18.6GB | 0.326GB |  

Apparently for data sets with majority of rare variants, converting to sparse matrices saves memory usage and often brings computational advantages too. When choosing the integer type of the row indices `rowval` and column pointer `colptr` in the `SparseMatrixCSC` format, make sure its maximum is larger than the number of nonzeros in the matrix.

In [23]:
hapmapf64 = convert(Matrix{Float64}, hapmap)
# memory usage if convert to Float64
Base.summarysize(hapmapf64)

36101376

In [24]:
# average maf
mean(maf)

0.222585591341583

In [25]:
hapmapf64sp = convert(SparseMatrixCSC{Float64, Int64}, hapmap)
# memory usage if convert to sparse Float64 matrix
Base.summarysize(hapmapf64sp)

25949488

In [26]:
hapmapf32sp = convert(SparseMatrixCSC{Float32, UInt32}, hapmap)
# memory usage if convert to sparse Float32 matrix
Base.summarysize(hapmapf32sp)

12974764

By default `convert()` method does **not** impute missing genotypes but convert them to `NaN`.

In [27]:
# number of missing genotypes
countnz(isnan(hapmap)), countnz(isnan(hapmapf64))

(11894,11894)

We can enforce imputation by setting optional argument `impute=true`. Imputation is done by generating two random alleles according to the minor allele frequency. This is not an optimal strategy and users should make sure genotypes are imputed with high quality using other advanced methods.

In [28]:
hapmapf64impute = convert(Matrix{Float64}, hapmap; impute = true)
countnz(isnan(hapmapf64impute))

0

By default `convert()` translates genotypes according to the *additive* SNP model, which essentially is the **minor allele** counts (0, 1 or 2). Other SNP models are *dominant* and *recessive*, both in terms of the **minor allele**. When `A1` is the minor allele, genotype is translated to real number according to

| Genotype | `SnpArray` | `model=:additive` | `model=:dominant` | `model=:recessive` |    
|---|---|---|---|---|  
| A1,A1 | 00 | 2 | 1 | 1 |  
| A1,A2 | 01 | 1 | 1 | 0 |  
| A2,A2 | 11 | 0 | 0 | 0 |  
| missing | 10 | NaN | NaN | NaN | 

When `A2` is the minor allele, genotype is translated according to

| Genotype | `SnpArray` | `model=:additive` | `model=:dominant` | `model=:recessive` |    
|---|---|---|---|---|  
| A1,A1 | 00 | 0 | 0 | 0 |  
| A1,A2 | 01 | 1 | 1 | 0 |  
| A2,A2 | 11 | 2 | 1 | 1 |  
| missing | 10 | NaN | NaN | NaN |

In [29]:
[convert(Vector{Float64}, hapmap[1:10, 5]; model = :additive) convert(Vector{Float64}, hapmap[1:10, 5]; model = :dominant) convert(Vector{Float64}, hapmap[1:10, 5]; model = :recessive)]

10x3 Array{Float64,2}:
 0.0  0.0  0.0
 0.0  0.0  0.0
 1.0  1.0  0.0
 1.0  1.0  0.0
 0.0  0.0  0.0
 2.0  1.0  1.0
 2.0  1.0  1.0
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0

By default `convert()` does **not** center and scale genotypes. Setting optional arguments `center=true, scale=true` centers genotypes at 2MAF and scales them by $[2 \cdot MAF \cdot (1 - MAF)]^{-1/2}$. Mono-allelic SNPs (MAF=0) are not scaled.

In [30]:
[convert(Vector{Float64}, hapmap[:, 5]) convert(Vector{Float64}, hapmap[:, 5]; center = true, scale = true)]

324x2 Array{Float64,2}:
 0.0  -1.25702 
 0.0  -1.25702 
 1.0   0.167017
 1.0   0.167017
 0.0  -1.25702 
 2.0   1.59106 
 2.0   1.59106 
 0.0  -1.25702 
 0.0  -1.25702 
 0.0  -1.25702 
 0.0  -1.25702 
 0.0  -1.25702 
 1.0   0.167017
 ⋮             
 2.0   1.59106 
 0.0  -1.25702 
 1.0   0.167017
 0.0  -1.25702 
 0.0  -1.25702 
 0.0  -1.25702 
 0.0  -1.25702 
 0.0  -1.25702 
 1.0   0.167017
 0.0  -1.25702 
 0.0  -1.25702 
 0.0  -1.25702 

`copy!()` is the in-place version of `convert()`. Options such as GWAS loop over SNPs and perform statistical anlaysis for each SNP. This can be achieved by

In [31]:
g = zeros(size(hapmap, 1))
for j = 1:size(hapmap, 2)
    copy!(g, hapmap[:, j]; model = :additive, impute = true)
    # do statistical anlaysis
end

## Empirical kinship

`grm()` computes the empirical kinship matrix using either the genetic relationship matrix (`model=:GRM`, default) or the method of moment method (`model=:MoM`). Missing genotypes are imputed according to minor allele frequencies on the fly.

In [32]:
# GRM using all SNPs
grm(hapmap)

324x324 Array{Float64,2}:
 0.566363   0.0443143  0.0188416  …  0.0626239  0.06858    0.0625061
 0.0443143  0.530639   0.0308712     0.0492802  0.04351    0.0607644
 0.0188416  0.0308712  0.511348      0.0446786  0.0288571  0.0350833
 0.0463298  0.0356164  0.0269727     0.0574615  0.0630155  0.0572398
 0.0503864  0.0410962  0.0241546     0.0690436  0.0560241  0.0631594
 0.0430385  0.0303637  0.0376147  …  0.0678557  0.0542501  0.0621521
 0.0380203  0.0212397  0.0116626     0.0423601  0.036314   0.0356672
 0.0393772  0.0367071  0.0204582     0.0555189  0.0526184  0.063383 
 0.0287     0.0295828  0.0159914     0.0319977  0.0440216  0.0369149
 0.0379033  0.040584   0.0253424     0.0639459  0.0557325  0.0469195
 0.0462206  0.0442091  0.0220441  …  0.0564275  0.064667   0.0586591
 0.0581122  0.0374812  0.0357274     0.0633427  0.0560261  0.0691391
 0.0342962  0.0421267  0.0250422     0.0547831  0.0673045  0.060752 
 ⋮                                ⋱                                 
 0.06348

In [33]:
# GRM using every other SNP
grm(sub(hapmap, :, 1:2:snps))

324x324 Array{Float64,2}:
 0.556297   0.04153    0.0270458  …  0.0648694  0.0714231  0.0661306
 0.04153    0.545419   0.0358033     0.0557851  0.0444889  0.0541418
 0.0270458  0.0358033  0.501238      0.0381639  0.0378719  0.0454146
 0.0435681  0.044228   0.0258406     0.0497162  0.058592   0.0539135
 0.0502747  0.0465918  0.0251001     0.0649697  0.0547835  0.0632559
 0.0505847  0.0389108  0.0391248  …  0.0749331  0.0593592  0.0510758
 0.0378583  0.025684   0.0157411     0.0453863  0.0331695  0.0333092
 0.0452519  0.0369919  0.0262638     0.0570625  0.0541252  0.0662456
 0.0257782  0.0219979  0.019691      0.0318039  0.041653   0.0345324
 0.0306723  0.0376943  0.0226684     0.0578309  0.0401891  0.0488515
 0.0472728  0.0479606  0.0197466  …  0.0604483  0.0626594  0.0509439
 0.0598418  0.047357   0.0387598     0.064034   0.0623163  0.0619618
 0.024331   0.0415089  0.0215236     0.0523223  0.0660809  0.06096  
 ⋮                                ⋱                                 
 0.05891

In [34]:
# MoM using all SNPs
grm(hapmap; method = :MoM)

324x324 Array{Float64,2}:
 0.539332   0.0352063  0.0028497   …  0.0541007  0.0635479  0.0506761
 0.0352063  0.518312   0.0147768      0.0419374  0.0392214  0.0503218
 0.0028497  0.0147768  0.499417       0.0329626  0.0212717  0.0202089
 0.0428821  0.0290656  0.0237516      0.0522112  0.0688619  0.0490228
 0.0448897  0.033553   0.0154853      0.0653192  0.0574072  0.0562263
 0.0317817  0.0207993  0.0264677   …  0.0594147  0.0486685  0.049259 
 0.0248144  0.0113522  0.00308588     0.0315455  0.030837   0.0244601
 0.0261134  0.0288295  0.0100532      0.038749   0.0461887  0.0463067
 0.0212717  0.0263496  0.0145406      0.0297742  0.0485505  0.0301284
 0.021626   0.0258772  0.0114703      0.053274   0.0503218  0.0320179
 0.0359148  0.0329626  0.00603812  …  0.0498494  0.0559901  0.0499675
 0.0478419  0.0316636  0.0270581      0.0551635  0.0546911  0.0519751
 0.0315455  0.0425279  0.0255229      0.0591785  0.0709875  0.0578795
 ⋮                                 ⋱                            

## Principal component analysis 

Principal compoenent analysis is widely used in genome-wide association analysis (GWAS) for adjusting population substructure. `pca(A, pcs)` computes the top `pcs` principal components based on SNP data `A`. Missing genotypes are imputed according minor allele frequencies on the fly. This means, in presence of missing genotypes, running the function on the same `SnpArray` twice may produce slightly different answers. Each SNP is centered at $2\text{MAF}$ and scaled by $[2\text{MAF}(1-\text{MAF})]^{-1/2}$. Internally `pca()` converts `SnpArray` to the matrix of minor allele counts. The default is `Matrix{Float64}`, which can easily exceed memory size. Users can optionally choose single precision matrix format in the third argument `pca(A, pcs, Matrix{Float32})`. The output is  

* `pcscore`: the `pcs` eigen-SNPs, or principal scores, in each column  
* `pcloading`: the `pcs` eigen-vectors, or principal loadings, in each column  
* `pcvariance`: the `pcs` eigen-values, or principal variances

In [35]:
pcscore, pcloading, pcvariance = pca(hapmap, 3)

(
324x3 Array{Float64,2}:
 38.7055  -1.30779      7.09544  
 32.6685  -1.14097      3.40319  
 22.8995  -0.558534   -11.1711   
 35.6898  -2.7327       2.20065  
 37.1527  -0.166859     3.56046  
 34.7805  -1.21837     -5.7239   
 21.9871  -5.73129     -2.4039   
 31.0485  -2.29076      0.0934425
 22.7861  -3.75631     -8.13587  
 32.2563  -0.29975     -3.22127  
 36.3146  -0.872441     5.29255  
 35.8666  -0.896326    -0.367194 
 33.95    -3.71151     -7.18578  
  ⋮                              
 49.2409   0.902726   -10.8924   
 46.974   -0.866845     0.542013 
 48.5366  -0.978929     0.099481 
 49.0279   0.271771    -6.00994  
 47.8224  -0.241864     6.98751  
 48.3291  -1.26235      0.629079 
 46.59    -3.53306      4.15501  
 48.9775  -1.66679     -0.138704 
 48.5506   1.40295      1.8158   
 50.1838   0.0780457    2.03419  
 48.9832  -2.07986     -2.16415  
 48.8319   0.327807    -6.52114  ,

13928x3 Array{Float64,2}:
  1.9763e-19    2.46399e-19   1.46868e-19
 -0.00159386   -0.00

To use eigen-SNPs for plotting or covariates in GWAS, we typically scale them by their standard deviations so that they have mean zero and unit variance.

In [36]:
scale!(pcscore, 1.0 ./ √(pcvariance))
std(pcscore, 1)
# plot or use in GWAS

1x3 Array{Float64,2}:
 1.00155  1.00155  1.00155

In [37]:
# principal components using every other SNP
pcscore, pcloading, pcvariance = pca(hapmap[:, 1:2:snps], 3)

(
324x3 Array{Float64,2}:
 27.9454  -0.711409     5.26249  
 23.7814  -2.32478      3.1562   
 16.6493   0.327457   -24.7695   
 24.3301  -2.09048     -3.17218  
 25.3713   0.734588     1.95283  
 25.6845  -0.648075    -3.94654  
 16.3993  -4.30347     -0.287887 
 23.0204  -2.01998     -3.72935  
 16.4587  -4.44452     10.7975   
 21.074   -0.0482502   -3.72487  
 25.8971  -1.89193      5.05045  
 26.0288  -1.22232      0.0641994
 23.6892  -2.17757      0.519391 
  ⋮                              
 34.6653   0.154151    -2.33152  
 31.6728  -1.23291      0.44583  
 34.7418  -1.00057     -6.48415  
 35.7385  -0.307271    -5.14196  
 33.7631  -1.1545       3.40345  
 34.2503  -1.49412     -2.53108  
 33.2109  -3.39573      2.64499  
 34.4034  -0.753437    -0.414166 
 34.9499   0.935909    -0.186999 
 35.1142   1.32041     -3.60889  
 33.6966  -1.70929     -0.404096 
 34.2183  -0.192178    -6.35462  ,

6964x3 Array{Float64,2}:
  2.26835e-18   1.06016e-19   7.99034e-19
  0.0258817    -0.006

For large data sets with majority of rare variants, `pca_sp()` is more efficient by first converting `SnpArray` to a sparse matrix (default `SparseMatrixCSC{Float64, Int64}`) and then computing principal components using iterative algorithms. 

In [38]:
pcscore, pcloading, pcvariance = pca_sp(hapmap, 3)

(
324x3 Array{Float64,2}:
 -38.7408   1.37515      7.0315    
 -32.6418   1.13487      2.72326   
 -23.0715   0.649067   -15.8401    
 -35.712    2.73164      1.84414   
 -37.1215   0.183687     4.05537   
 -34.8853   1.22878     -6.65263   
 -22.0008   5.75263     -1.86876   
 -31.033    2.24304      0.07411   
 -22.828    3.83526     -7.69284   
 -32.196    0.284564    -3.05915   
 -36.3809   0.834598     5.56534   
 -35.8359   0.909841    -0.6881    
 -33.945    3.77571     -7.15246   
   ⋮                               
 -49.1615  -0.957994    -9.9759    
 -46.9836   0.747927     0.888539  
 -48.5501   1.02731      0.390933  
 -48.9944  -0.292083    -5.34304   
 -47.6961   0.31503      7.0442    
 -48.241    1.3228       0.466566  
 -46.4845   3.53639      4.06458   
 -48.8228   1.69691     -0.00918346
 -48.5829  -1.40163      2.25708   
 -50.2077  -0.0843356    2.06916   
 -48.9875   2.00278     -1.80979   
 -48.8809  -0.353344    -6.19071   ,

13928x3 Array{Float64,2}:
  1.62947e

In [39]:
versioninfo()

Julia Version 0.4.5
Commit 2ac304d (2016-03-18 00:58 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3
