In [1]:
versioninfo()

Julia Version 0.7.0
Commit a4cb80f3ed (2018-08-08 06:46 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = code


In [2]:
using BEDFiles, BenchmarkTools

## Constructor

### Initialize from existing plink files

In [3]:
const bf = BEDFile(BEDFiles.datadir("mouse.bed"))

1940×10150 BEDFile:
 0x02  0x02  0x02  0x02  0x03  0x02  …  0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x02  0x03  0x02  0x02  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x02  0x02  0x02  0x02  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x02  0x02  0x02  0x02  0x02  0x02
 0x02  0x02  0x02  0x02  0x03  0x02  …  0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x02  0x02  0x02  0x03  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x02  0x03  0x02  0x02  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x02  0x03  0x02  0x02  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x02  0x02  0x02  0x02  0x02  0x02
 0x03  0x03  0x03  0x03  0x03  0x03  …  0x00  0x00  0x00  0x00  0x00  0x00
 0x02  0x02  0x02  0x02  0x03  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
    ⋮

The virtual size of the GWAS data is 1940 observations at each of 10150 SNP positions

In [4]:
size(bf)

(1940, 10150)

The actual size of the memory-mapped matrix of `UInt8` values is 485 rows and 10150 columns

In [5]:
size(bf.data)

(485, 10150)

Because the file is memory-mapped opening the file and accessing the data is fast, even for very large .bed files.

In [6]:
@benchmark(BEDFile(BEDFiles.datadir("mouse.bed")))

BenchmarkTools.Trial: 
  memory estimate:  389.42 KiB
  allocs estimate:  61
  --------------
  minimum time:     104.182 μs (0.00% GC)
  median time:      218.947 μs (0.00% GC)
  mean time:        237.422 μs (11.91% GC)
  maximum time:     46.310 ms (98.77% GC)
  --------------
  samples:          10000
  evals/sample:     1

This file, from a study published in 2006, is about 5 Mb in size but data from recent studies, which have samples from tens of
thousands of individuals at over a million SNP positions, would be in the tens or even hundreds of Gb range.

By default, the memory-mapped file is read only, changing entries is not allowed

In [7]:
bf[1, 1] = 0x00

ReadOnlyMemoryError: ReadOnlyMemoryError()

To change a bed file, open with write permission
```julia
bf = BEDFile(BEDFiles.datadir("mouse.bed"), "w")
```

### Initialize and create bed file

Initialize 5 rows and 3 columns with all A1 alleles (0x00) and memory-map to a bed file

In [8]:
tmpbf = BEDFile("tmp.bed", 5, 3)

5×3 BEDFile:
 0x00  0x00  0x00
 0x00  0x00  0x00
 0x00  0x00  0x00
 0x00  0x00  0x00
 0x00  0x00  0x00

In [9]:
tmpbf[1:2, 1:2] .= 0x03
tmpbf

5×3 BEDFile:
 0x03  0x03  0x00
 0x03  0x03  0x00
 0x00  0x00  0x00
 0x00  0x00  0x00
 0x00  0x00  0x00

In [10]:
fill!(tmpbf, 0x02)
tmpbf

5×3 BEDFile:
 0x02  0x02  0x02
 0x02  0x02  0x02
 0x02  0x02  0x02
 0x02  0x02  0x02
 0x02  0x02  0x02

In [11]:
rm("tmp.bed")

Initialize 5 rows and 3 columns with undefined genotypes without memory-mapping to any file

In [12]:
tmpbf = BEDFile(undef, 5, 3)

5×3 BEDFile:
 0x02  0x00  0x03
 0x00  0x00  0x00
 0x00  0x00  0x00
 0x00  0x00  0x00
 0x00  0x00  0x00

Create a bed file corresponding to an existing BEDFile and memory-map it.

In [13]:
tmpbf = BEDFile("tmp.bed", tmpbf)

5×3 BEDFile:
 0x02  0x00  0x03
 0x00  0x00  0x00
 0x00  0x00  0x00
 0x00  0x00  0x00
 0x00  0x00  0x00

In [14]:
tmpbf[1, 1] = 0x02
tmpbf

5×3 BEDFile:
 0x02  0x00  0x03
 0x00  0x00  0x00
 0x00  0x00  0x00
 0x00  0x00  0x00
 0x00  0x00  0x00

In [15]:
rm("tmp.bed")

## Convert to `Float`s

Most common usage is to convert genotypes to numeric values for statistical analysis. Conversion rule depends on genetic models (additive, dominant, or recessive), centering, scaling, or imputation.

### `copyto!`

Copy a column to a Float64 vector using defaults (`model=:additive`, `center=false`, `scale=false`, `impute=false`)

In [16]:
v = zeros(size(bf, 1))
copyto!(v, bf, 1)

1940-element Array{Float64,1}:
 1.0
 1.0
 2.0
 1.0
 2.0
 1.0
 1.0
 1.0
 1.0
 2.0
 2.0
 1.0
 2.0
 ⋮  
 2.0
 2.0
 1.0
 1.0
 2.0
 1.0
 2.0
 1.0
 1.0
 1.0
 1.0
 0.0

In [17]:
@benchmark(copyto!($v, $bf, 1))

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.297 μs (0.00% GC)
  median time:      6.612 μs (0.00% GC)
  mean time:        6.782 μs (0.00% GC)
  maximum time:     17.564 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

Copy columns using defaults

In [18]:
v2 = zeros(size(bf, 1), 2)
copyto!(v2, bf, 1:2)

1940×2 Array{Float64,2}:
 1.0  1.0
 1.0  1.0
 2.0  2.0
 1.0  1.0
 2.0  2.0
 1.0  1.0
 1.0  1.0
 1.0  1.0
 1.0  1.0
 2.0  2.0
 2.0  2.0
 1.0  1.0
 2.0  2.0
 ⋮       
 2.0  2.0
 2.0  2.0
 1.0  1.0
 1.0  1.0
 2.0  2.0
 1.0  1.0
 2.0  2.0
 1.0  1.0
 1.0  1.0
 1.0  1.0
 1.0  1.0
 0.0  0.0

In [19]:
@benchmark(copyto!($v2, $bf, 1:2))

BenchmarkTools.Trial: 
  memory estimate:  96 bytes
  allocs estimate:  2
  --------------
  minimum time:     9.164 μs (0.00% GC)
  median time:      9.469 μs (0.00% GC)
  mean time:        9.907 μs (0.00% GC)
  maximum time:     54.309 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Convert according to dominant model

In [20]:
copyto!(v, bf, 1, model = :dominant)

1940-element Array{Float64,1}:
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 ⋮  
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 0.0

In [21]:
@benchmark(copyto!($v, $bf, 1, model=:dominant))

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.565 μs (0.00% GC)
  median time:      3.736 μs (0.00% GC)
  mean time:        4.152 μs (0.00% GC)
  maximum time:     20.106 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     8

Convert according to recessive model

In [22]:
copyto!(v, bf, 1, model=:recessive)

1940-element Array{Float64,1}:
 0.0
 0.0
 1.0
 0.0
 1.0
 0.0
 0.0
 0.0
 0.0
 1.0
 1.0
 0.0
 1.0
 ⋮  
 1.0
 1.0
 0.0
 0.0
 1.0
 0.0
 1.0
 0.0
 0.0
 0.0
 0.0
 0.0

In [23]:
@benchmark(copyto!($v, $bf, 1, model=:recessive))

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.770 μs (0.00% GC)
  median time:      4.150 μs (0.00% GC)
  mean time:        4.164 μs (0.00% GC)
  maximum time:     30.898 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     8

Center and scale

In [24]:
copyto!(v, bf, 1, center=true, scale=true)

1940-element Array{Float64,1}:
 -0.16084075452851265
 -0.16084075452851265
  1.2624897581484626 
 -0.16084075452851265
  1.2624897581484626 
 -0.16084075452851265
 -0.16084075452851265
 -0.16084075452851265
 -0.16084075452851265
  1.2624897581484626 
  1.2624897581484626 
 -0.16084075452851265
  1.2624897581484626 
  ⋮                  
  1.2624897581484626 
  1.2624897581484626 
 -0.16084075452851265
 -0.16084075452851265
  1.2624897581484626 
 -0.16084075452851265
  1.2624897581484626 
 -0.16084075452851265
 -0.16084075452851265
 -0.16084075452851265
 -0.16084075452851265
 -1.584171267205488  

In [25]:
@benchmark(copyto!($v, $bf, 1, center=true, scale=true))

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.656 μs (0.00% GC)
  median time:      7.843 μs (0.00% GC)
  mean time:        8.385 μs (0.00% GC)
  maximum time:     67.898 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

In [26]:
copyto!(v, bf, 1, center=true, scale=true)
@show count(isnan, v)
@show mean(v);

count(isnan, v) = 2
mean(v) = NaN


Impute by mean value

In [27]:
copyto!(v, bf, 1, center=true, scale=true, impute=true)
@show count(isnan, v)
@show mean(v);

count(isnan, v) = 0
mean(v) = -1.648166139649717e-17


In [28]:
@benchmark(copyto!($v, $bf, 1, center=true, scale=true, impute=true))

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.434 μs (0.00% GC)
  median time:      7.663 μs (0.00% GC)
  mean time:        8.038 μs (0.00% GC)
  maximum time:     28.615 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     4

In [29]:
copyto!(v2, bf, 1:2, center=true, scale=true, impute=true)

1940×2 Array{Float64,2}:
 -0.160841  -0.15993
 -0.160841  -0.15993
  1.26249    1.2633 
 -0.160841  -0.15993
  1.26249    1.2633 
 -0.160841  -0.15993
 -0.160841  -0.15993
 -0.160841  -0.15993
 -0.160841  -0.15993
  1.26249    1.2633 
  1.26249    1.2633 
 -0.160841  -0.15993
  1.26249    1.2633 
  ⋮                 
  1.26249    1.2633 
  1.26249    1.2633 
 -0.160841  -0.15993
 -0.160841  -0.15993
  1.26249    1.2633 
 -0.160841  -0.15993
  1.26249    1.2633 
 -0.160841  -0.15993
 -0.160841  -0.15993
 -0.160841  -0.15993
 -0.160841  -0.15993
 -1.58417   -1.58316

In [30]:
mean(v2, dims=1)

1×2 Array{Float64,2}:
 -1.64817e-17  1.09878e-17

Looping over all columns

In [31]:
@benchmark for j in 1:$size($bf, 2)
    copyto!($v, $bf, j)
end

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     49.829 ms (0.00% GC)
  median time:      54.384 ms (0.00% GC)
  mean time:        53.947 ms (0.00% GC)
  maximum time:     59.878 ms (0.00% GC)
  --------------
  samples:          93
  evals/sample:     1

### `convert`

`copyto!` is the in-place version of `convert`

In [32]:
[convert(Vector{Float64}, bf, 1) convert(Vector{Float64}, bf, 1, model=:dominant) convert(Vector{Float64}, bf, 1, model=:recessive)]

1940×3 Array{Float64,2}:
 1.0  1.0  0.0
 1.0  1.0  0.0
 2.0  1.0  1.0
 1.0  1.0  0.0
 2.0  1.0  1.0
 1.0  1.0  0.0
 1.0  1.0  0.0
 1.0  1.0  0.0
 1.0  1.0  0.0
 2.0  1.0  1.0
 2.0  1.0  1.0
 1.0  1.0  0.0
 2.0  1.0  1.0
 ⋮            
 2.0  1.0  1.0
 2.0  1.0  1.0
 1.0  1.0  0.0
 1.0  1.0  0.0
 2.0  1.0  1.0
 1.0  1.0  0.0
 2.0  1.0  1.0
 1.0  1.0  0.0
 1.0  1.0  0.0
 1.0  1.0  0.0
 1.0  1.0  0.0
 0.0  0.0  0.0

In [33]:
[convert(Vector{Float64}, bf, 1) convert(Vector{Float64}, bf, 1, impute=true) convert(Vector{Float64}, bf, 1, center=true, scale=true)]

1940×3 Array{Float64,2}:
 1.0  1.0  -0.160841
 1.0  1.0  -0.160841
 2.0  2.0   1.26249 
 1.0  1.0  -0.160841
 2.0  2.0   1.26249 
 1.0  1.0  -0.160841
 1.0  1.0  -0.160841
 1.0  1.0  -0.160841
 1.0  1.0  -0.160841
 2.0  2.0   1.26249 
 2.0  2.0   1.26249 
 1.0  1.0  -0.160841
 2.0  2.0   1.26249 
 ⋮                  
 2.0  2.0   1.26249 
 2.0  2.0   1.26249 
 1.0  1.0  -0.160841
 1.0  1.0  -0.160841
 2.0  2.0   1.26249 
 1.0  1.0  -0.160841
 2.0  2.0   1.26249 
 1.0  1.0  -0.160841
 1.0  1.0  -0.160841
 1.0  1.0  -0.160841
 1.0  1.0  -0.160841
 0.0  0.0  -1.58417 

In [34]:
convert(Matrix{Float16}, bf, 1:2:10)

1940×5 Array{Float16,2}:
 1.0  1.0  2.0  2.0  1.0
 1.0  2.0  1.0  1.0  1.0
 2.0  2.0  2.0  2.0  2.0
 1.0  1.0  1.0  1.0  1.0
 2.0  2.0  2.0  2.0  2.0
 1.0  1.0  2.0  2.0  1.0
 1.0  1.0  2.0  2.0  1.0
 1.0  2.0  1.0  1.0  1.0
 1.0  2.0  1.0  1.0  1.0
 2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0
 1.0  1.0  2.0  2.0  1.0
 2.0  2.0  2.0  2.0  2.0
 ⋮                      
 2.0  2.0  2.0  2.0  2.0
 2.0  2.0  2.0  2.0  2.0
 1.0  1.0  1.0  1.0  1.0
 1.0  1.0  2.0  2.0  1.0
 2.0  2.0  2.0  2.0  2.0
 1.0  1.0  2.0  2.0  1.0
 2.0  2.0  2.0  2.0  2.0
 1.0  1.0  2.0  2.0  1.0
 1.0  2.0  1.0  1.0  1.0
 1.0  2.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0
 0.0  0.0  2.0  2.0  0.0

In [35]:
convert(Matrix{Float16}, bf)

1940×10150 Array{Float16,2}:
 1.0  1.0  1.0  1.0  2.0  1.0  2.0  1.0  …    2.0    2.0    2.0    2.0    2.0
 1.0  1.0  2.0  1.0  1.0  1.0  1.0  2.0       2.0    2.0    2.0    2.0    2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0       2.0    2.0    2.0    2.0    2.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  2.0       2.0    2.0    2.0    2.0    2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0       1.0    1.0    1.0    1.0    1.0
 1.0  1.0  1.0  1.0  2.0  1.0  2.0  1.0  …    2.0    2.0    2.0    2.0    2.0
 1.0  1.0  1.0  1.0  2.0  1.0  2.0  1.0       2.0    2.0    2.0    2.0    2.0
 1.0  1.0  2.0  1.0  1.0  1.0  1.0  2.0       2.0    2.0    2.0    2.0    2.0
 1.0  1.0  2.0  1.0  1.0  1.0  1.0  2.0       2.0    2.0    2.0    2.0    2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0       1.0    1.0    1.0    1.0    1.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  …    0.0    0.0    0.0    0.0    0.0
 1.0  1.0  1.0  1.0  2.0  1.0  2.0  1.0       2.0    2.0    2.0    2.0    2.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0 

In [36]:
@benchmark(convert(Matrix{Float64}, $bf))

BenchmarkTools.Trial: 
  memory estimate:  150.84 MiB
  allocs estimate:  19791
  --------------
  minimum time:     119.476 ms (5.85% GC)
  median time:      121.227 ms (5.81% GC)
  mean time:        126.685 ms (8.46% GC)
  maximum time:     176.689 ms (29.62% GC)
  --------------
  samples:          40
  evals/sample:     1

## Raw summaries

### Counts

Counts of each the four possible values for each column are returned by `counts`.`

In [37]:
counts(bf, dims=1)

4×10150 Array{Int64,2}:
  358   359  252   358    33   359  …    56    56    56    56    56    56
    2     0    4     3     4     1      173   173   162   173   174   175
 1003  1004  888  1004   442  1004      242   242   242   242   242   242
  577   577  796   575  1461   576     1469  1469  1480  1469  1468  1467

Column 2 has no missing values (code `0x01`, the second row in the column-counts table).
In that SNP position for this sample, 359 indivduals are homozygous allele 1 (`G` according to the `.bim` file), 1004 are heterozygous,
and 577 are homozygous allele 2 (`A`).

The counts by column and by row are cached in the `BEDFile` object.
Accesses after the first are extremely fast.

In [38]:
@benchmark(counts($bf, dims=1))

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.669 ns (0.00% GC)
  median time:      9.437 ns (0.00% GC)
  mean time:        9.742 ns (0.00% GC)
  maximum time:     35.384 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

### Minor allele frequencies

Minor allele frequencies (MAF)

In [39]:
maf(bf)

10150-element Array{Float64,1}:
 0.4434984520123839  
 0.4438144329896907  
 0.359504132231405   
 0.4439855446566856  
 0.13119834710743805 
 0.44404332129963897 
 0.1412706611570248  
 0.30299123259412064 
 0.4445018069179143  
 0.44424367578729995 
 0.43427835051546393 
 0.14075413223140498 
 0.304639175257732   
 ⋮                   
 0.0527624309392265  
 0.052980132450331174
 0.08079096045197742 
 0.08253250423968339 
 0.08253250423968339 
 0.10022650056625138 
 0.10016977928692694 
 0.10016977928692694 
 0.09955005624296964 
 0.10016977928692694 
 0.10022650056625138 
 0.10028328611898019 

Minor allele (`false` means A1 is the minor allele; `true` means A2 is the minor allele)

In [40]:
minorallele(bf)

10150-element Array{Bool,1}:
 false
 false
 false
 false
 false
 false
 false
 false
 false
 false
 false
 false
 false
     ⋮
 false
 false
 false
 false
 false
 false
 false
 false
 false
 false
 false
 false


## Instantiating as a count of the second allele

In some operations on GWAS data the data are converted to counts of the second allele, according to

|BEDFile|count   |
|------:|--------:|
| 0x00  | 0       |
| 0x01  | missing |
| 0x10  | 1       |
| 0x11  | 2       |

This can be accomplished by indexing `bedvals` with the `BEDFile` or with a view of the `BEDFile`,
producing an array of type `Union{Missing,Int8}`, which is the preferred way in v0.7 of
representing arrays that may contain missing values.

In [41]:
bedvals[bf]

1940×10150 Array{Union{Missing, Int8},2}:
 1  1  1  1  2  1  2  1  1  1  1  2  1  1  2  …  2         2         2       
 1  1  2  1  1  1  1  2  1  1  1  1  2  2  1     2         2         2       
 2  2  2  2  2  2  2  2  2  2  2  2  2  2  2     2         2         2       
 1  1  1  1  1  1  1  2  1  1  1  1  2  2  1     2         2         2       
 2  2  2  2  2  2  2  2  2  2  2  2  2  2  2     1         1         1       
 1  1  1  1  2  1  2  1  1  1  1  2  1  1  2  …  2         2         2       
 1  1  1  1  2  1  2  1  1  1  1  2  1  1  2     2         2         2       
 1  1  2  1  1  1  1  2  1  1  1  1  2  2  1     2         2         2       
 1  1  2  1  1  1  1  2  1  1  1  1  2  2  1     2         2         2       
 2  2  2  2  2  2  2  2  2  2  2  2  2  2  2     1         1         1       
 2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  …  0         0         0       
 1  1  1  1  2  1  2  1  1  1  1  2  1  1  2     2         2         2       
 2  2  2  2  2  2  2  

In [42]:
sort(unique(bedvals[bf]))

4-element Array{Union{Missing, Int8},1}:
 0       
 1       
 2       
  missing

## Summary statistics

The package provides methods for the generics `mean` and `var` from the `Statistics` package.

In [43]:
mean(bf, dims=1)

1×10150 Array{Float64,2}:
 1.113  1.11237  1.28099  1.11203  …  1.8009  1.79966  1.79955  1.79943

In [44]:
var(bf, dims=1)

1×10150 Array{Float64,2}:
 0.469929  0.470089  0.462605  0.469365  …  0.223714  0.223818  0.223923

These methods make use of the cached column or row counts and thus are very fast

In [45]:
@benchmark(mean($bf, dims=1))

BenchmarkTools.Trial: 
  memory estimate:  79.39 KiB
  allocs estimate:  2
  --------------
  minimum time:     12.448 μs (0.00% GC)
  median time:      16.292 μs (0.00% GC)
  mean time:        19.996 μs (17.21% GC)
  maximum time:     1.696 ms (98.50% GC)
  --------------
  samples:          10000
  evals/sample:     1

The column-wise or row-wise standard deviations are returned by `std`.

In [46]:
std(bf, dims=2)

1940×1 Array{Float64,2}:
 0.6504997290784408
 0.6379008244533891
 0.6558172726141286
 0.6532675479248437
 0.6744432174014563
 0.6519092298111158
 0.6779881845456428
 0.6955814098050999
 0.6437566832989493
 0.6505283141088536
 0.665444994623426 
 0.659392039592328 
 0.6641674726999468
 ⋮                 
 0.6599158250006595
 0.688387450736178 
 0.6664063015924304
 0.6613451651895259
 0.6659810347614777
 0.6274577846909379
 0.6823658517777204
 0.6695299551061924
 0.710756592739754 
 0.6387913736114869
 0.6736492722732016
 0.688855476425891 

## Location of the missing values


The positions of the missing data are evaluated by

In [47]:
mp = missingpos(bf)

1940×10150 SparseArrays.SparseMatrixCSC{Bool,Int32} with 33922 stored entries:
  [702  ,     1]  =  true
  [949  ,     1]  =  true
  [914  ,     3]  =  true
  [949  ,     3]  =  true
  [1604 ,     3]  =  true
  [1891 ,     3]  =  true
  [81   ,     4]  =  true
  [990  ,     4]  =  true
  [1882 ,     4]  =  true
  [81   ,     5]  =  true
  [676  ,     5]  =  true
  [990  ,     5]  =  true
  ⋮
  [1791 , 10150]  =  true
  [1795 , 10150]  =  true
  [1846 , 10150]  =  true
  [1848 , 10150]  =  true
  [1851 , 10150]  =  true
  [1853 , 10150]  =  true
  [1860 , 10150]  =  true
  [1873 , 10150]  =  true
  [1886 , 10150]  =  true
  [1894 , 10150]  =  true
  [1897 , 10150]  =  true
  [1939 , 10150]  =  true

In [48]:
@benchmark(missingpos($bf))

BenchmarkTools.Trial: 
  memory estimate:  1.81 MiB
  allocs estimate:  19273
  --------------
  minimum time:     34.268 ms (0.00% GC)
  median time:      38.049 ms (0.00% GC)
  mean time:        38.179 ms (0.39% GC)
  maximum time:     43.449 ms (0.00% GC)
  --------------
  samples:          131
  evals/sample:     1

So, for example, the number of missing data values in each column can be evaluated as

In [49]:
sum(mp, dims=1)

1×10150 Array{Int64,2}:
 2  0  4  3  4  1  4  1  3  3  0  4  0  …  174  173  173  162  173  174  175

although it is faster, but somewhat more obscure, to use

In [50]:
view(counts(bf, dims=1), 2:2, :)

1×10150 view(::Array{Int64,2}, 2:2, :) with eltype Int64:
 2  0  4  3  4  1  4  1  3  3  0  4  0  …  174  173  173  162  173  174  175

### Missing rate

Proportion of missing genotypes

In [51]:
missingrate(bf, 1)

10150-element Array{Float64,1}:
 0.0010309278350515464
 0.0                  
 0.002061855670103093 
 0.0015463917525773195
 0.002061855670103093 
 0.0005154639175257732
 0.002061855670103093 
 0.0005154639175257732
 0.0015463917525773195
 0.0015463917525773195
 0.0                  
 0.002061855670103093 
 0.0                  
 ⋮                    
 0.06701030927835051  
 0.06597938144329897  
 0.08762886597938144  
 0.08814432989690722  
 0.08814432989690722  
 0.08969072164948454  
 0.08917525773195877  
 0.08917525773195877  
 0.08350515463917525  
 0.08917525773195877  
 0.08969072164948454  
 0.09020618556701031  

In [52]:
missingrate(bf, 2)

1940-element Array{Float64,1}:
 0.00019704433497536947
 0.0                   
 0.018423645320197045  
 0.0007881773399014779 
 0.0                   
 0.004236453201970443  
 0.0051231527093596055 
 0.00039408866995073894
 0.005517241379310344  
 0.0016748768472906405 
 0.0                   
 9.852216748768474e-5  
 0.0004926108374384236 
 ⋮                     
 0.000689655172413793  
 0.004729064039408867  
 0.0004926108374384236 
 0.001083743842364532  
 0.00019704433497536947
 0.0025615763546798028 
 0.0038423645320197044 
 0.001379310344827586  
 0.0064039408866995075 
 0.002857142857142857  
 0.0011822660098522167 
 0.00029556650246305416

## Genetic relation matrix

Classical genetic relation matrix

In [53]:
grm(bf, method=:GRM)

1940×1940 Array{Float64,2}:
  0.478301    -0.0331304    0.0135612    …  -0.0347737   -0.0129443 
 -0.0331304    0.422771    -0.0389227        0.0457987    0.00556832
  0.0135612   -0.0389227    0.509248        -0.0356689   -0.0608705 
  0.0198205    0.00728645  -0.00935362      -0.0302404   -0.0102152 
  0.056747    -0.0163418   -0.00495283      -0.0413347   -0.0415659 
 -0.0165628   -0.0191127   -0.0112181    …   0.0177118   -0.0193087 
  0.123771    -0.0404167    0.00442739       0.00880649  -0.0437565 
 -0.0628362    0.172552    -0.0728312        0.0640027   -0.0281429 
  0.0605018   -0.0260505    0.00398852      -0.00277754  -0.0607773 
  0.108886    -0.0204594   -0.00767711      -0.0210501    0.00343526
 -0.0142307    0.00270989  -0.0235504    …  -0.0223563   -0.028408  
 -0.0306022    0.197743    -0.00244269       0.0213998   -0.0478472 
 -0.0131463   -0.0226707    0.0223522       -0.037288     0.0493662 
  ⋮                                      ⋱                          
  0.01

Method of moment estimate of GRM (Day-Williams)

In [54]:
grm(bf, method=:MoM)

1940×1940 Array{Float64,2}:
  0.476924    -0.0108051   -0.00387107  …  -0.0132575   -0.0186653 
 -0.0108051    0.470286    -0.0372729       0.093392     0.0195127 
 -0.00387107  -0.0372729    0.455097       -0.0342803   -0.0930451 
 -0.00446372   0.00530804  -0.0545261      -0.026341    -0.0410591 
  0.0424276   -0.0154829   -0.0473783      -0.0425746   -0.064802  
 -0.0168192   -0.00230175  -0.0291755   …   0.0417797   -0.0208387 
  0.100186    -0.0381512   -0.0448414       0.0191225   -0.0663409 
 -0.0540423    0.204933    -0.0883685       0.0876198   -0.0275354 
  0.0426121   -0.0248726   -0.0293912       0.00252887  -0.0912029 
  0.0679449   -0.0333021   -0.0607221      -0.0317169   -0.0428171 
 -0.014776     0.0207843   -0.0437361   …  -0.00785222  -0.0374659 
 -0.0213453    0.239663    -0.0174233       0.0507615   -0.046752  
 -0.0175153   -0.0047731   -0.00615514     -0.0147855    0.034085  
  ⋮                                     ⋱                          
  0.0364703    0.026

Robust GRM by Thompson

In [55]:
grm(bf, method=:Robust)

1940×1940 Array{Float64,2}:
  0.473365    -0.0320711     0.0155127   …  -0.0378746    -0.0129801 
 -0.0320711    0.431313     -0.0355964       0.0510677     0.00749065
  0.0155127   -0.0355964     0.497424       -0.0359548    -0.0644173 
  0.0126728    0.00473735   -0.014447       -0.0302627    -0.0146786 
  0.0625811   -0.0130366    -0.00428215     -0.0434793    -0.0354045 
 -0.0224134   -0.0256032    -0.0118271   …   0.0151273    -0.0171889 
  0.116858    -0.0391866    -0.00522697      0.0147361    -0.0404251 
 -0.061218     0.18005      -0.0726015       0.0593859    -0.0254671 
  0.0616498   -0.0235422     0.0125891       0.00050831   -0.0629212 
  0.100637    -0.0183175    -0.00508762     -0.0200833    -0.00088128
 -0.0196773   -0.00182424   -0.0256948   …  -0.0338118    -0.0331233 
 -0.0329798    0.210321     -0.00611522      0.0180688    -0.0491425 
 -0.018995    -0.0239601     0.0153077      -0.0373235     0.0418492 
  ⋮                                      ⋱                    

In [56]:
@benchmark(grm($bf, method=:GRM))

BenchmarkTools.Trial: 
  memory estimate:  29.26 MiB
  allocs estimate:  10168
  --------------
  minimum time:     473.042 ms (0.15% GC)
  median time:      546.560 ms (0.37% GC)
  mean time:        548.844 ms (0.33% GC)
  maximum time:     634.939 ms (0.35% GC)
  --------------
  samples:          10
  evals/sample:     1

In [57]:
@benchmark(grm($bf, method=:MoM))

BenchmarkTools.Trial: 
  memory estimate:  29.26 MiB
  allocs estimate:  10168
  --------------
  minimum time:     459.410 ms (0.40% GC)
  median time:      486.963 ms (0.40% GC)
  mean time:        489.846 ms (0.35% GC)
  maximum time:     526.898 ms (0.37% GC)
  --------------
  samples:          11
  evals/sample:     1

In [58]:
@benchmark(grm($bf, method=:Robust))

BenchmarkTools.Trial: 
  memory estimate:  29.26 MiB
  allocs estimate:  10168
  --------------
  minimum time:     466.357 ms (0.42% GC)
  median time:      494.470 ms (0.38% GC)
  mean time:        491.460 ms (0.35% GC)
  maximum time:     511.205 ms (0.15% GC)
  --------------
  samples:          11
  evals/sample:     1

## Fitering

Count number of rows and columns that have proportion of missingness < 0.01.

In [59]:
@show rowmr = count(missingrate(bf, 2) .< 0.01)
@show colmr = count(missingrate(bf, 1) .< 0.01);

rowmr = count(missingrate(bf, 2) .< 0.01) = 1907
colmr = count(missingrate(bf, 1) .< 0.01) = 9997


Filter according to minimum success rates (1 - proportion of missing genotypes) per row and column

In [60]:
rowmask, colmask =  BEDFiles.filter(bf, 0.99, 0.99)

(Bool[true, true, false, true, true, true, true, true, true, true  …  true, true, true, true, true, true, true, true, true, true], Bool[true, true, true, true, true, true, true, true, true, true  …  false, false, false, false, false, false, false, false, false, false])

In [61]:
count(rowmask), count(colmask)

(1907, 9997)

In [62]:
bf[rowmask, colmask]

1907×9997 Array{UInt8,2}:
 0x02  0x02  0x02  0x02  0x03  0x02  …  0x02  0x03  0x02  0x02  0x03  0x03
 0x02  0x02  0x03  0x02  0x02  0x02     0x03  0x03  0x03  0x00  0x03  0x03
 0x02  0x02  0x02  0x02  0x02  0x02     0x03  0x03  0x03  0x00  0x00  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x02  0x03  0x02  0x02  0x02  0x03
 0x02  0x02  0x02  0x02  0x03  0x02     0x03  0x00  0x00  0x03  0x03  0x03
 0x02  0x02  0x02  0x02  0x03  0x02  …  0x03  0x03  0x00  0x03  0x00  0x00
 0x02  0x02  0x03  0x02  0x02  0x02     0x00  0x03  0x03  0x00  0x03  0x03
 0x02  0x02  0x03  0x02  0x02  0x02     0x03  0x00  0x03  0x00  0x00  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x02  0x03  0x02  0x02  0x02  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x00  0x03  0x00  0x03
 0x02  0x02  0x02  0x02  0x03  0x02  …  0x03  0x03  0x03  0x00  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x00  0x00  0x03  0x03  0x03
 0x02  0x02  0x02  0x02  0x03  0x02     0x03  0x00  0x03  0x00  0x03  0x03

In [63]:
@benchmark(BEDFiles.filter($bf, 0.99, 0.99))

BenchmarkTools.Trial: 
  memory estimate:  96.52 KiB
  allocs estimate:  8
  --------------
  minimum time:     74.425 ms (0.00% GC)
  median time:      82.933 ms (0.00% GC)
  mean time:        82.446 ms (0.00% GC)
  maximum time:     99.489 ms (0.00% GC)
  --------------
  samples:          61
  evals/sample:     1

## Filter plink files

Filter a set of Plink files according to row indices and column indices.

In [64]:
BEDFiles.filter(BEDFiles.datadir("mouse"), 1:5, 1:5)

5×5 BEDFile:
 0x02  0x02  0x02  0x02  0x03
 0x02  0x02  0x03  0x02  0x02
 0x03  0x03  0x03  0x03  0x03
 0x02  0x02  0x02  0x02  0x02
 0x03  0x03  0x03  0x03  0x03

In [65]:
rm(BEDFiles.datadir("mouse") * ".filtered.bed")
rm(BEDFiles.datadir("mouse") * ".filtered.fam")
rm(BEDFiles.datadir("mouse") * ".filtered.bim")

Filter a set of Plink files according to logical vectors.

In [66]:
BEDFiles.filter(BEDFiles.datadir("mouse"), rowmask, colmask)

1907×9997 BEDFile:
 0x02  0x02  0x02  0x02  0x03  0x02  …  0x02  0x03  0x02  0x02  0x03  0x03
 0x02  0x02  0x03  0x02  0x02  0x02     0x03  0x03  0x03  0x00  0x03  0x03
 0x02  0x02  0x02  0x02  0x02  0x02     0x03  0x03  0x03  0x00  0x00  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x02  0x03  0x02  0x02  0x02  0x03
 0x02  0x02  0x02  0x02  0x03  0x02     0x03  0x00  0x00  0x03  0x03  0x03
 0x02  0x02  0x02  0x02  0x03  0x02  …  0x03  0x03  0x00  0x03  0x00  0x00
 0x02  0x02  0x03  0x02  0x02  0x02     0x00  0x03  0x03  0x00  0x03  0x03
 0x02  0x02  0x03  0x02  0x02  0x02     0x03  0x00  0x03  0x00  0x00  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x02  0x03  0x02  0x02  0x02  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x00  0x03  0x00  0x03
 0x02  0x02  0x02  0x02  0x03  0x02  …  0x03  0x03  0x03  0x00  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x00  0x00  0x03  0x03  0x03
 0x02  0x02  0x02  0x02  0x03  0x02     0x03  0x00  0x03  0x00  0x03  0x03
    ⋮ 

In [67]:
rm(BEDFiles.datadir("mouse") * ".filtered.bed")
rm(BEDFiles.datadir("mouse") * ".filtered.fam")
rm(BEDFiles.datadir("mouse") * ".filtered.bim")